# Data Exploration

## Plan

* We'll start with exploring categorical values and applying some standard data cleaning steps.
    * Remove spaces.
    * Convert to lower case.
    * Unicode normalization.
    * Handling missing/unknown categories.
* We'll create `scikit-learn` pipelines that we can reusing during training. 
* We'll do the same for numerical data as well. 
* At the end of this notebook we'll have list of data preparation steps needed to train the model.

## Import Libraries

In [12]:
import pandas as pd
import numpy as np
import matplotlib as plt

from pathlib import Path

## Read Training Data

In [13]:
## root directory for all data files
data_dir = Path("..", "data")

In [14]:
X_train = pd.read_csv(Path(data_dir,"X_train.csv"))
y_train = pd.read_csv(Path(data_dir,"y_train.csv"))

In [15]:
X_train.shape,y_train.shape

((22320, 16), (22320, 1))

## Exploring Categorical Data

In [16]:
## lets list the categorical columns
X_train.select_dtypes(include=["object"]).dtypes

gender               object
city                 object
profession           object
sleep_duration       object
dietary_habits       object
degree               object
suicidal_thoughts    object
family_history       object
dtype: object

In [17]:
## lets look at the data to make sure they are correctly typed as object
X_train.select_dtypes(include=["object"]).head(5)

Unnamed: 0,gender,city,profession,sleep_duration,dietary_habits,degree,suicidal_thoughts,family_history
0,Male,Jaipur,Student,'7-8 hours',Moderate,'Class 12',Yes,No
1,Male,Vadodara,Student,'7-8 hours',Moderate,B.Arch,No,Yes
2,Male,Ahmedabad,Student,'7-8 hours',Unhealthy,M.Ed,Yes,Yes
3,Male,Bhopal,Student,'7-8 hours',Moderate,B.Com,Yes,No
4,Male,Patna,Student,'5-6 hours',Unhealthy,B.Com,No,No


In [18]:
## creating column list for easier access
category_columns = X_train.select_dtypes(include=["object"]).dtypes.index.tolist()
category_columns

['gender',
 'city',
 'profession',
 'sleep_duration',
 'dietary_habits',
 'degree',
 'suicidal_thoughts',
 'family_history']

### Handling Missing Values

In [19]:
## lets check for missing values
X_train.select_dtypes(include=["object"]).isnull().sum()

gender               0
city                 0
profession           0
sleep_duration       0
dietary_habits       0
degree               0
suicidal_thoughts    0
family_history       0
dtype: int64

Luckily there are no missing values but our training pipeline should have a step to fill missing values with "unkonwn" in case production data or test data has missing values. 

In [20]:
## TODO Add this to pipeline
X_train.select_dtypes(include=["object"]).fillna("unknown", inplace=True)

Lets create Column Transformer to transform the data for easy exploration

In [46]:
from sklearn.preprocessing import FunctionTransformer
## creating functional transformers

def to_lower_case(df, columns=None):
    print(columns)
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    if columns is None:
        columns = df_copy.select_dtypes(include=["object"]).columns
    for col in columns:
        df_copy[col] = df_copy[col].str.lower()
    return df_copy

case_transformer = FunctionTransformer(to_lower_case)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
# ## Creating column transfomer

ct = ColumnTransformer([(
    "case_transformer", case_transformer,category_columns
)])

ct.set_output(transform="pandas")

ct.fit_transform(X=X_train)

# cat_pipeline = make_pipeline(case_transformer)

# temp = cat_pipeline.fit_transform(X_train,y_train)

None


Unnamed: 0,case_transformer__gender,case_transformer__city,case_transformer__profession,case_transformer__sleep_duration,case_transformer__dietary_habits,case_transformer__degree,case_transformer__suicidal_thoughts,case_transformer__family_history
0,male,jaipur,student,'7-8 hours',moderate,'class 12',yes,no
1,male,vadodara,student,'7-8 hours',moderate,b.arch,no,yes
2,male,ahmedabad,student,'7-8 hours',unhealthy,m.ed,yes,yes
3,male,bhopal,student,'7-8 hours',moderate,b.com,yes,no
4,male,patna,student,'5-6 hours',unhealthy,b.com,no,no
...,...,...,...,...,...,...,...,...
22315,male,kolkata,student,'7-8 hours',unhealthy,b.com,yes,no
22316,female,patna,student,'less than 5 hours',unhealthy,msc,yes,yes
22317,male,lucknow,student,'7-8 hours',healthy,b.arch,yes,yes
22318,female,kolkata,student,'5-6 hours',unhealthy,md,yes,no


In [45]:
X_train.head()

Unnamed: 0,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work_study_hours,financial_stress,family_history
0,Male,18.0,Jaipur,Student,4.0,0.0,6.02,1.0,0.0,'7-8 hours',Moderate,'Class 12',Yes,3.0,5.0,No
1,Male,25.0,Vadodara,Student,3.0,0.0,6.37,2.0,0.0,'7-8 hours',Moderate,B.Arch,No,9.0,1.0,Yes
2,Male,30.0,Ahmedabad,Student,3.0,0.0,9.24,2.0,0.0,'7-8 hours',Unhealthy,M.Ed,Yes,5.0,5.0,Yes
3,Male,34.0,Bhopal,Student,3.0,0.0,7.37,5.0,0.0,'7-8 hours',Moderate,B.Com,Yes,12.0,3.0,No
4,Male,25.0,Patna,Student,3.0,0.0,7.47,4.0,0.0,'5-6 hours',Unhealthy,B.Com,No,11.0,5.0,No


### Handling Missing Values

In [None]:
X_train.isna().sum()

gender                0
age                   0
city                  0
profession            0
academic_pressure     0
work_pressure         0
cgpa                  0
study_satisfaction    0
job_satisfaction      0
sleep_duration        0
dietary_habits        0
degree                0
suicidal_thoughts     0
work_study_hours      0
financial_stress      2
family_history        0
dtype: int64

In [None]:
X_train.isnull().sum()

gender                0
age                   0
city                  0
profession            0
academic_pressure     0
work_pressure         0
cgpa                  0
study_satisfaction    0
job_satisfaction      0
sleep_duration        0
dietary_habits        0
degree                0
suicidal_thoughts     0
work_study_hours      0
financial_stress      2
family_history        0
dtype: int64

* So far we only have "financial_stress" with couple NaN values, we can fix this with a SimpleImputer or KNNImputer

### Exploring Categorical Values

In [None]:
## lets explore the "gender" column
X_train["gender"].value_counts()

gender
Male      12437
Female     9883
Name: count, dtype: int64

* Since the dataset has only 2 genders, we'll map them to 1 for "Male" and 0 for "Female".

In [None]:
## lets explore "city" column
X_train["city"].value_counts()

city
Kalyan                  1284
Srinagar                1073
Hyderabad               1063
Vasai-Virar             1042
Lucknow                  943
Thane                    910
Kolkata                  890
Agra                     864
Ludhiana                 848
Surat                    842
Jaipur                   840
Patna                    823
Visakhapatnam            763
Pune                     751
Bhopal                   748
Ahmedabad                748
Chennai                  707
Meerut                   660
Rajkot                   633
Bangalore                625
Delhi                    602
Ghaziabad                588
Mumbai                   563
Vadodara                 561
Varanasi                 550
Nagpur                   533
Indore                   519
Kanpur                   493
Nashik                   452
Faridabad                381
Harsha                     2
Bhavna                     2
Saanvi                     2
City                       2
Khaziabad

* So city column needs some clearning, we can see there are some values which are obviously not a city but rather distances, person names and education degree names. 
* We also have access to a mapping between major Indian cities and their latlong in a different dataset.  We can use that dataset to validate the city names and map latlong to them. 
* Plan is to then use lat-long to create new features based on similarity. 

In [None]:
## lets add a temporary flag column is_city which will be set to 1 if its a valid city.
## we'll also use this in test in case our test data cities are invalid.
X_train["is_city"] = 0

In [None]:
## we'll also add a default lat/long to all cities. 
## for now lets default it center of India
## after searching I learnt that `Nagpur` is the geographical center of India so we'll set the default lat/long to that

X_train["lat"] = 21.122615
X_train["long"] = 79.041124

In [None]:
X_train.head()

Unnamed: 0,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work_study_hours,financial_stress,family_history,is_city,lat,long
1657,Male,18.0,Jaipur,Student,4.0,0.0,6.02,1.0,0.0,'7-8 hours',Moderate,'Class 12',Yes,3.0,5.0,No,0,21.122615,79.041124
24995,Male,25.0,Vadodara,Student,3.0,0.0,6.37,2.0,0.0,'7-8 hours',Moderate,B.Arch,No,9.0,1.0,Yes,0,21.122615,79.041124
27613,Male,30.0,Ahmedabad,Student,3.0,0.0,9.24,2.0,0.0,'7-8 hours',Unhealthy,M.Ed,Yes,5.0,5.0,Yes,0,21.122615,79.041124
13512,Male,34.0,Bhopal,Student,3.0,0.0,7.37,5.0,0.0,'7-8 hours',Moderate,B.Com,Yes,12.0,3.0,No,0,21.122615,79.041124
27029,Male,25.0,Patna,Student,3.0,0.0,7.47,4.0,0.0,'5-6 hours',Unhealthy,B.Com,No,11.0,5.0,No,0,21.122615,79.041124


* Lets read the city list and try to clean up the cities column

In [None]:
## read master city list
city_list = pd.read_csv(Path(data_dir,"in.csv"))
city_list.head()

Unnamed: 0,city,lat,lng,country,iso2,admin_name,capital,population,population_proper
0,Delhi,28.61,77.23,India,IN,Delhi,admin,32226000,16753235
1,Mumbai,19.0761,72.8775,India,IN,Mahārāshtra,admin,24973000,12478447
2,Kolkāta,22.5675,88.37,India,IN,West Bengal,admin,18502000,4496694
3,Bangalore,12.9789,77.5917,India,IN,Karnātaka,admin,15386000,8443675
4,Chennai,13.0825,80.275,India,IN,Tamil Nādu,admin,12395000,6727000


In [None]:
city_list.shape

(162, 9)

* We only have ~162 cities, which means this might have just the major cities of India, lets see if its enough to cleanup our dataset.

In [None]:
## lower and strip the city list for accurate comparison
cities = city_list["city"].str.strip().str.lower().tolist()
cities 

['delhi',
 'mumbai',
 'kolkāta',
 'bangalore',
 'chennai',
 'hyderābād',
 'pune',
 'ahmedabad',
 'sūrat',
 'lucknow',
 'jaipur',
 'kanpur',
 'mirzāpur',
 'nāgpur',
 'ghāziābād',
 'supaul',
 'vadodara',
 'rājkot',
 'vishākhapatnam',
 'indore',
 'thāne',
 'bhopāl',
 'pimpri-chinchwad',
 'patna',
 'bilāspur',
 'ludhiāna',
 'āgra',
 'madurai',
 'jamshedpur',
 'prayagraj',
 'nāsik',
 'farīdābād',
 'meerut',
 'jabalpur',
 'kalyān',
 'vasai-virar',
 'najafgarh',
 'vārānasi',
 'srīnagar',
 'aurangābād',
 'dhanbād',
 'amritsar',
 'alīgarh',
 'guwāhāti',
 'hāora',
 'rānchi',
 'gwalior',
 'chandīgarh',
 'haldwāni',
 'vijayavāda',
 'jodhpur',
 'raipur',
 'kota',
 'bhayandar',
 'loni',
 'ambattūr',
 'salt lake city',
 'bhātpāra',
 'kūkatpalli',
 'dāsarhalli',
 'muzaffarpur',
 'oulgaret',
 'new delhi',
 'tiruvottiyūr',
 'puducherry',
 'byatarayanpur',
 'pallāvaram',
 'secunderābād',
 'shimla',
 'puri',
 'murtazābād',
 'shrīrāmpur',
 'chandannagar',
 'sultānpur mazra',
 'krishnanagar',
 'bārākpur',
 

In [None]:
def is_valid_city(row):
    row.is_city = int(row.city.strip().lower() in cities)
    return row
    

X_train = X_train.apply(is_valid_city, axis=1)
    

* Lets check the number of valid vs invlid cities

In [None]:
X_train["is_city"].value_counts()

is_city
0    12443
1     9877
Name: count, dtype: int64

In [None]:
X_train.loc[X_train["is_city"] == 0,"city"].value_counts()

city
Kalyan                  1284
Srinagar                1073
Hyderabad               1063
Thane                    910
Kolkata                  890
Agra                     864
Ludhiana                 848
Surat                    842
Visakhapatnam            763
Bhopal                   748
Rajkot                   633
Ghaziabad                588
Varanasi                 550
Nagpur                   533
Nashik                   452
Faridabad                381
Harsha                     2
Bhavna                     2
Saanvi                     2
City                       2
Khaziabad                  1
M.Com                      1
3.0                        1
Harsh                      1
Mihir                      1
M.Tech                     1
Gaurav                     1
Nalyan                     1
Nandini                    1
'Less than 5 Kalyan'       1
Reyansh                    1
'Less Delhi'               1
Mira                       1
Name: count, dtype: int64

In [None]:
city_list["city"].head(5)

0        Delhi
1       Mumbai
2      Kolkāta
3    Bangalore
4      Chennai
Name: city, dtype: object

* So visually it seems the master cities list that we have has "ā" in them instead of "a" thats why a lot of cities are not matching. 
* Lets fix that and then try to update the dataset again. 

In [None]:
import unicodedata

city_list["city"] = city_list["city"].map(lambda ct: unicodedata.normalize("NFKD",ct).encode("ascii","ignore").decode())

In [None]:
city_list["city"].head(5)

0        Delhi
1       Mumbai
2      Kolkata
3    Bangalore
4      Chennai
Name: city, dtype: object

* Looks like this fixed it, lets do the same thing even for our dataset before we map the is_city flag

In [None]:
X_train["city"] = X_train["city"].map(lambda ct: unicodedata.normalize("NFKD",ct).encode("ascii","ignore").decode())

In [None]:
## lower and strip the city list for accurate comparison
cities = city_list["city"].str.strip().str.lower().tolist()

def is_valid_city(row):
    row.is_city = int(row.city.strip().lower() in cities)
    return row
    

X_train = X_train.apply(is_valid_city, axis=1)

In [None]:
X_train["is_city"].value_counts()

is_city
1    21084
0     1236
Name: count, dtype: int64

This is much better now we have only 1236 invalid values. Lets take a look at them

In [None]:
X_train.loc[X_train["is_city"] == 0, "city"].value_counts()

city
Visakhapatnam           763
Nashik                  452
Harsha                    2
Bhavna                    2
Saanvi                    2
City                      2
Khaziabad                 1
M.Com                     1
3.0                       1
Harsh                     1
Mihir                     1
Nalyan                    1
Gaurav                    1
Nandini                   1
'Less than 5 Kalyan'      1
Reyansh                   1
'Less Delhi'              1
M.Tech                    1
Mira                      1
Name: count, dtype: int64

* So apart from Visakhapatnam, Nashik and Khaziabad rest seem to be wrong values. 
* Its surprising that these cities are missing from our master city list, these are few of the larger cities in India. 
* Just to future proof this we'll explore another dataset with a more detailed list of cities that we can rely on. 

In [None]:
## read master city list
master_city_list = pd.read_csv(Path(data_dir,"detailed_in.csv"))
master_city_list.head()

Unnamed: 0,name,ascii_name,lat,long
0,Rāvi River,Ravi River,30.62123,71.82683
1,Punjab Plains,Punjab Plains,30.0,75.0
2,Jhelum River,Jhelum River,31.16853,72.15066
3,Hindustan,Hindustan,28.0,76.0
4,Basantar River,Basantar River,32.47452,75.01449


In [None]:
## extract cities and prepare for comparison
master_cities = master_city_list["ascii_name"].str.strip().str.lower().to_list()
master_cities[:5]

['ravi river', 'punjab plains', 'jhelum river', 'hindustan', 'basantar river']

In [None]:
## lets reset is_valid_city to 0 
X_train["is_city"] = 0


def is_valid_city(row):    
    row.is_city = int(row.city.strip().lower() in master_cities)
    return row
    

X_train = X_train.apply(is_valid_city, axis=1)

In [None]:
X_train["is_city"].value_counts()

is_city
1    20634
0     1686
Name: count, dtype: int64

In [None]:
X_train.loc[X_train["is_city"] == 0, "city"].value_counts()

city
Vasai-Virar             1042
Bangalore                625
Harsha                     2
Bhavna                     2
Saanvi                     2
City                       2
Khaziabad                  1
3.0                        1
Harsh                      1
Mihir                      1
Nalyan                     1
Gaurav                     1
'Less than 5 Kalyan'       1
Reyansh                    1
'Less Delhi'               1
M.Tech                     1
M.Com                      1
Name: count, dtype: int64

* So in our master city list has Vasai and virar as separate cities and Banglore is divided into separate regions like banglore urban, banglore rural etc. 
* Lets try fuzzy match and see if gives us better results

In [None]:
master_cities