# Lab | Revisiting Machine Learning Case Study

- In this lab, you will use `learningSet.csv` file which you already have cloned in today's activities. 

### Instructions

Complete the following steps on the **categorical columns** in the dataset:

- Check for null values in all the columns
- Exclude the following variables by looking at the definitions. Create a new empty list called `drop_list`. We will append this list and then drop all the columns in this list later:
    - `OSOURCE` - symbol definitions not provided, too many categories
    - `ZIP CODE` - we are including state already
- Identify columns that over 85% missing values
- Remove those columns from the dataframe
- Reduce the number of categories in the column `GENDER`. The column should only have either "M" for males, "F" for females, and "other" for all the rest
    - Note that there are a few null values in the column. We will first replace those null values using the code below:

    ```python
    print(categorical['GENDER'].value_counts())
    categorical['GENDER'] = categorical['GENDER'].fillna('F')
    ```


In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('learningSet.csv')
display(data.head(), data.shape)

  data = pd.read_csv('learningSet.csv')


Unnamed: 0,ODATEDW,OSOURCE,TCODE,STATE,ZIP,MAILCODE,PVASTATE,DOB,NOEXCH,RECINHSE,...,TARGET_D,HPHONE_D,RFA_2R,RFA_2F,RFA_2A,MDMAUD_R,MDMAUD_F,MDMAUD_A,CLUSTER2,GEOCODE2
0,8901,GRI,0,IL,61081,,,3712,0,,...,0.0,0,L,4,E,X,X,X,39.0,C
1,9401,BOA,1,CA,91326,,,5202,0,,...,0.0,0,L,2,G,X,X,X,1.0,A
2,9001,AMH,1,NC,27017,,,0,0,,...,0.0,1,L,4,E,X,X,X,60.0,C
3,8701,BRY,0,CA,95953,,,2801,0,,...,0.0,1,L,4,E,X,X,X,41.0,C
4,8601,,0,FL,33176,,,2001,0,X,...,0.0,1,L,2,F,X,X,X,26.0,A


(95412, 481)

#### Standard column names

In [3]:
#cols = []

#for c in data.columns:
#    cols.append(c.lower().replace(" ", "_"))

#data.columns = cols
#data.columns

#### Spliting numerical-categorical columns

In [4]:
data.dtypes

ODATEDW       int64
OSOURCE      object
TCODE         int64
STATE        object
ZIP          object
             ...   
MDMAUD_R     object
MDMAUD_F     object
MDMAUD_A     object
CLUSTER2    float64
GEOCODE2     object
Length: 481, dtype: object

In [5]:
cat_data = data.select_dtypes(include = 'object')

display(cat_data.head(), cat_data.shape)

Unnamed: 0,OSOURCE,STATE,ZIP,MAILCODE,PVASTATE,NOEXCH,RECINHSE,RECP3,RECPGVG,RECSWEEP,...,RFA_21,RFA_22,RFA_23,RFA_24,RFA_2R,RFA_2A,MDMAUD_R,MDMAUD_F,MDMAUD_A,GEOCODE2
0,GRI,IL,61081,,,0,,,,,...,S4E,S4E,S4E,S4E,L,E,X,X,X,C
1,BOA,CA,91326,,,0,,,,,...,N1E,N1E,,F1E,L,G,X,X,X,A
2,AMH,NC,27017,,,0,,,,,...,,S4D,S4D,S3D,L,E,X,X,X,C
3,BRY,CA,95953,,,0,,,,,...,A1D,A1D,,,L,E,X,X,X,C
4,,FL,33176,,,0,X,X,,,...,A3D,I4E,A3D,A3D,L,F,X,X,X,A


(95412, 74)

#### Check for null values in all the columns

In [6]:
cat_data.isna().sum()

OSOURCE       0
STATE         0
ZIP           0
MAILCODE      0
PVASTATE      0
           ... 
RFA_2A        0
MDMAUD_R      0
MDMAUD_F      0
MDMAUD_A      0
GEOCODE2    132
Length: 74, dtype: int64

In [7]:
# Filtering out columns without NaN values
nan_counts = cat_data.isna().sum()
nan_counts = nan_counts[nan_counts > 0]

nan_counts

GEOCODE2    132
dtype: int64

#### Exclude the following variables by looking at the definitions.

Create a new empty list called drop_list. We will append this list and then drop all the columns in this list later:

    - OSOURCE - symbol definitions not provided, too many categories
    - ZIP CODE - we are including state already

In [8]:
# Finding columns with 'zip' in their title
zip_cols = [col for col in data.columns if 'ZIP' in col.lower()]
print(zip_cols)

[]


#### Identify columns that over 85% missing values --> only categorical?

In [9]:
# Columns with empty rows
cat_data.columns[cat_data.isna().any()].tolist()

# The .tolist() method is used to convert an array or a pandas Series into a Python list.

['GEOCODE2']

In [10]:
# Calculating percentages
missing_percent = cat_data.isna().mean() * 100
missing_percent

OSOURCE     0.000000
STATE       0.000000
ZIP         0.000000
MAILCODE    0.000000
PVASTATE    0.000000
              ...   
RFA_2A      0.000000
MDMAUD_R    0.000000
MDMAUD_F    0.000000
MDMAUD_A    0.000000
GEOCODE2    0.138347
Length: 74, dtype: float64

In [11]:
# Selecting all columns with >85% of missing values --> none in cat, multiple in data
missing_percent[missing_percent > 85]

Series([], dtype: float64)

#### Remove those columns from the dataframe

#### Reduce the number of categories in the column `GENDER`.
The column should only have either "M" for males, "F" for females, and "other" for all the rest.


Note that there are a few null values in the column. We will first replace those null values using the code below:

    ```python
    print(categorical['GENDER'].value_counts())
    categorical['GENDER'] = categorical['GENDER'].fillna('F')
    ```

In [12]:
cat_data['GENDER'].value_counts()

F    51277
M    39094
      2957
U     1715
J      365
C        2
A        2
Name: GENDER, dtype: int64

In [13]:
# Filling null values as indicated
cat_data['GENDER'] = cat_data['GENDER'].fillna('F')

cat_data['GENDER'].value_counts()

F    51277
M    39094
      2957
U     1715
J      365
C        2
A        2
Name: GENDER, dtype: int64

In [14]:
# Reducing number of categories
cat_data['GENDER'] = cat_data["GENDER"].replace({" ": "other", "U": "other", "J": "other", "C": "other", "A": "other", "" : "other"})

# print the value counts to verify the result
cat_data["GENDER"].value_counts()

F        51277
M        39094
other     5041
Name: GENDER, dtype: int64