# A very hard classification problem
by: _Andrés M. Castillo_

*Universidad del Valle*

This notebook aims to practice the most common preprocessing steps that you must carry previously to build a classification model. We found the dataset in [Kaggle](https://www.kaggle.com/sayeera/classification). It does not have any description or discussion opened at the time I have written this. 

This practice aims to create fancy and meaningful charts that help in the data understanding process. As you must remember, this is one of the critical steps in the data mining process.

The second part of the practice is to create a classification model that predicts or explains the job satisfaction of a company's employees, based on the related information available in this dataset. 

To accomplish the challenge, the student must provide a very detailed and fancy description of the variables depending on its type, and some analysis for each variable against the target variable "jobSatisfaction". Use histograms, pie charts, scatter plots, boxplots, violins plots depending on what is more appropriate. 
<img src="visu.jpeg" alt="visualization" style="width: 600px;"/>

We start importing the needed libraries. 

In [None]:
# Descomentar en caso de No tener instaladas las librerias
#!conda install -c conda-forge ipywidgets -y
#!conda install -c anaconda graphviz python-graphviz -y
#!conda install -c conda-forge keras -y
#!conda install -c anaconda pydot -y
#!conda install -c anaconda seaborn -y

# importando el modulo numpy
import numpy as np

# importando el modulo pandas
import pandas as pd

# importando el modulo de expresiones regulares
import re

# importando el modulo matplotlib
import matplotlib.pyplot as plt

# importando el modulo sns de seaborn
import seaborn as sns

In [None]:
!pip install pandas

## Pandas

Pandas is a potent toolset for data manipulations. It is very convenient to start smashing data in python. The main feature provided by Pandas is the data frame structure. You use pandas during all this course. For that reason, please get familiar with the tool:  (https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html)

## Open the dataset

Each time we open a dataset file, it is good to print the column names and size of the dataset. Just to be have a first glance at the dataset.

In [None]:
data = pd.read_csv("./data/Classification.csv")
print(data.columns)
print(data.shape)

Then, it will be good to print some lines to see how the data looks like. This can be done using the data frame function ``data.head(num_lines).``

**Assignment 01**

Print the first 10 lines of the dataset

In [None]:
pd.set_option('display.max_columns', None) # Force to "print" all columns
# START CODE HERE
# Display the first 10 rows of the data
None
# END CODE HERE

Using this table, you must try to determine the data type of each column. Note that by default, the columns only have 2 types: Numerical, and String. You can see the type using the ``type(variable)`` function.

You can access the columns of pandas data frame using the static or the dynamic way:
* `data.column_name`: Static form
* `data['column_name']`: Dynamic form

And you can access the elements of a list or array using the brakets notation: 
```
foo = [7, 3, -14]
foo[1]
```

In [None]:
print(type(data.Age[0]))   # Print the type of the Age variable. 
print(type(data['Attrition'][0]))  # Print the type of the Attrition variable

**Assignment 2**

Store in a python list([]), all the types for each column and print it

**Hint**

* In python you can traverse any iterable object using the "for" "in" syntax. For example to print all the elements of a given list you simply do:

```
for value in [1, 6, 3, 7]:
    print(value)
```

* Remember that you got the column names previously
* You can access any column in a dataframe using its name: data['A'] 

In [None]:
types = []
# START CODE HERE
None
# END CODE HERE
print(types)

**Expected output**
```cpp
[<class 'numpy.int64'>, <class 'str'>, <class 'str'>, <class 'numpy.int64'>, <class 'str'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'str'>, <class 'numpy.int64'>, <class 'str'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'str'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'str'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.bool_'>]
```

## Data types for Data Mining

It is important to know, that the former classification for the previous data into integers and strings is not really appropiated for most of the data mining techniques.

So now, you must use your previous knowledge about data types and your capacity to undestand data, in order to classify each attribute in one of the following classes:
* categorical data, or factors
* Ordinal data, or levels
* ratios, or numerical data
* intervals. Another kind of numerical data, without an absolute 0
* text. Used for long strings containing descriptions, or any text not representing a category

## Group the attributes depending on its type

**Assignment 03**

Create three groups of attributes: real/numerical, ordinal, and categorical. In some cases, you will need to create a 4th group for all the text columns. 

Check: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [None]:
# START CODE HERE
numericColumns = [None]
factorColumns = [None]
levelColumns = [None]
# END CODE HERE

target = "JobSatisfaction"

Now, we are going to convert each attribute to its corresponding data type using functions defined by pandas.
In the following cell, we will convert the data as follows: 
* Attributes contained in the `numericColumns` to pd.to_numeric. This is not really necesary in Python but is better to be as explicit as you can.
* Attributes contained in the `factorColumns` to unordered [CategoricalDtype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalDtype.html) 
* Attributes contained in the `levelColumns` to ordered CategoricalDtype 

In [None]:
from pandas.api.types import CategoricalDtype

# Convert to numeric attributes
# Nothing to do in python. Numeric is the dafault
for col in numericColumns:
    data[col] = pd.to_numeric(data[col], errors='coerce')
    
# loop to change each column to category type
for col in factorColumns:
    # START CODE HERE
    cat_type = None
    data[col] = None
    # END CODE HERE

# Conver to levels / Ordinals
for col in levelColumns:
    # START CODE HERE
    cat_type = None
    data[col] = None
    # END CODE HERE

In [None]:
data['JobSatisfaction']

Print attributes 'Attrition' and 'EnvironmentSatisfaction'. Note the difference in the printing of both columns. Interpret the diference.

In [None]:
data['Attrition']

In [None]:
data['EnvironmentSatisfaction'].values

## Manipulating the data frame

**Assignment 04**
1. Create a new binary attribute for JobSatisfaction. This value will be True if JobSatisfaction is higher than 2 and False in other case.
2. Use the EmployeeNumber as the row name for the data frame. 
3. Once you did step 2, delete the attribute EmployeeNumber and EmployeeCount

In [None]:
# Add a new attribute to try a binary class classification first. 
# It must be categorical. But booleans are actually categoricals

# START CODE HERE
# Create new attribute call class from JobSatisfaction
data['class'] = None

# Assign EmployeeNumber to the index of the corresponding column
data.index = None

# Delete the "EmployeeCount" and "EmployeeNumber" column from the dataframe
None
None

# START CODE HERE

In [None]:
data.head(5)

# Understanding our data

Numerical data is the prefered data type for most of the data mining techniques. For that reason, Pandas comes with a builtin function to resume numerical data. Use the `data.describe()` to obtain a table with an statistical resume of your numerical data 

In [None]:
data.describe()

But that is not the only way to undestand the data. Lets see what can be done in Python, using the **matplotlib** and **pandas**. In the following cell, we will create 2 pie charts to resume the `JobSatisfaction` attribute

In [None]:
# declarando un objeto tipo Figura para desarrollar los subplots
fig = plt.figure(figsize=(20, 10))
ax = fig.add_subplot(1,2,1)

data['JobSatisfaction'].value_counts().plot(kind='pie', 
                                            figsize=(12, 10),
                                            autopct='%1.1f%%', # add in percentages
                                            startangle=90,     # start angle 90° (Africa)
                                            shadow=True,       # add shadow  
                                            explode=[0, 0, 0.1, 0.1] 
                                            )
plt.title('Job satisfaction at 4 levels. 1 is not satisfied and 4 is very satisfied at job ')


ax = fig.add_subplot(1,2,2)
data['class'].value_counts().plot(kind='pie', 
                                            figsize=(12, 10),
                                            autopct='%1.1f%%', # add in percentages
                                            startangle=90,     # start angle 90° (Africa)
                                            shadow=True,       # add shadow   
                                            explode=[0, 0.1] 
                                            )
plt.title('Job satisfaction at 2 levels')


### Distribution of numerical attributes. Let's use a histrogram

To better undestand numerical data we can use histograms. Let's see an example:

In [None]:
fig = plt.figure(figsize=(10, 5))
plt.title('Age')
data['Age'].plot(kind='hist', rwidth=1)
plt.show()

**Assignment 05**
Print a histogram for each numeric column in our dataset

In [None]:
# declarando un objeto tipo Figura para desarrollar los subplots
fig = plt.figure(figsize=(20, 20))

x = 1
# Declarando las graficas de tipo Histograma Variables Númericas
for numAtt in numericColumns:
    ax = fig.add_subplot(6, 3, x)
    # START CODE HERE
    plt.title(None)   # Add the column name as title
    None
    x+=1
# START CODE HERE
    
plt.show()


### Distribution of categorical and ordinal attributes. 

Let's use pie charts for both. Note that each chart is clock-wise sorted based on the frequency of each class.

In [None]:
fig = plt.figure(figsize=(30, 60))
x = 1
for catAtt in factorColumns:
    ax = fig.add_subplot(6,3,x)
    data[catAtt].value_counts().plot(kind='pie', ax=ax, startangle=115, fontsize=12)
    x = x + 1
    
plt.show()


It is the same for ordinal attibutes. However we would like to keep the natural order of the classes in this class. Can you find a solution for this?

In [None]:
fig = plt.figure(figsize=(30, 60))
x = 1
for catAtt in levelColumns:
    ax = fig.add_subplot(6,3,x)
    data[catAtt].value_counts().plot(kind='pie', ax=ax, startangle=115, fontsize=12)
    x = x + 1
plt.show()

### now, lets use histograms. 

For ordinal attributes you better keep the order of the variable, but for no ordinal, 
you better order base on the value count

In [None]:
# Example of a single bar plot
data['Education'].value_counts().plot(kind='bar', figsize=(5, 5), rot=90).set_title('Education')

**Exercise** Create bar plots for all the factorColumns

In [None]:
fig = plt.figure(figsize=(30, 100))
x = 1
# START CODE HERE
# Loop over the factorColumns
for None in None:
    ax = fig.add_subplot(6,7,x)
    # Add the subplot
    None
    # Increment x
    x = None
# END CODE HERE

plt.show()

Now, you can create a bar plot for an ordinal variable. In this case, note the `sort=False` as parameter for value_counts. This keeps the natural order of the classes in the plot

In [None]:
# Example of a single bar plot
data['JobInvolvement'].value_counts(sort=False).plot(kind='bar', figsize=(5, 5), rot=90).set_title('JobInvolvement')

**Exercise** Create bar plots for all the levelColumns

In [None]:
fig = plt.figure(figsize=(30, 60))
x = 1
# START CODE HERE
# Loop over the levelColumns
for None in None:
    ax = fig.add_subplot(6, 6, x)
    # Add the subplot
    None
    # Increment x
    x = None
# END CODE HERE
plt.show()

## Evaluate if some variables has some classification power

### Violin plots

Let's use seaborn to make some fancy charts for numeric vs categorical data. Violin plots are a powerfull tool for this purpose.

See: https://seaborn.pydata.org/generated/seaborn.violinplot.html

In [None]:
fig = plt.figure(figsize = (20, 20))
x = 1
for numAtt in numericColumns:
    ax = fig.add_subplot(5, 3 , x)
    sns.violinplot(x = "JobSatisfaction", figsize=(20, 20), y = numAtt, data = data)
    x = x + 1

### Box plots

Sometimes, box plots allows to display more clearly the correlation between numeric and categorical attributes

See: https://seaborn.pydata.org/tutorial/categorical.html

In [None]:
f, axes = plt.subplots(nrows = 5, ncols = 3, figsize = (30,30))
axes = axes.flatten()
x = 0
for numAtt in numericColumns:
    # ax = fig.add_subplot(6,7,x)
    sns.boxplot(x = "JobSatisfaction", y = numAtt, data = data, ax=axes[x])
    x = x + 1

### Do you see something???

## Categorical vs Categorical

Now lets try to determine if some of the categorical attributes has some classification power. Let's try the count plot.

In [None]:
f, axes = plt.subplots(nrows = 3, ncols = 3, figsize = (20,20))
axes = axes.flatten()
x = 0
for catAtt in factorColumns:
    cross = pd.crosstab(index=data[catAtt], 
                        columns=data["JobSatisfaction"],
                        normalize='index')
    cross.plot(kind="bar", 
                 stacked=True,
                 ax=axes[x])

    #sns.countplot(y = catAtt, hue="JobSatisfaction", data=data, ax=axes[x]);
    x = x + 1