# Data Exploration

## Plan

* We'll start with Preprocessing categorical values and applying some standard data cleaning steps.
    * Remove spaces.
    * Convert to lower case.
    * Unicode normalization.
    * Handling missing/unknown categories.
* We'll create `scikit-learn` pipelines that we can reusing during training. 
* We'll do the same for numerical data as well. 
* At the end of this notebook we'll have list of data preparation steps needed to train the model.

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt

from pathlib import Path

## Read Training Data

In [2]:
## root directory for all data files
data_dir = Path("..", "data")

In [3]:
X_train = pd.read_csv(Path(data_dir,"X_train.csv"))
y_train = pd.read_csv(Path(data_dir,"y_train.csv"))

In [4]:
X_train.shape,y_train.shape

((22320, 16), (22320, 1))

## Preprocessing Categorical Data

In [5]:
## lets list the categorical columns
X_train.select_dtypes(include=["object"]).dtypes

gender               object
city                 object
profession           object
sleep_duration       object
dietary_habits       object
degree               object
suicidal_thoughts    object
family_history       object
dtype: object

In [6]:
## lets look at the data to make sure they are correctly typed as object
X_train.select_dtypes(include=["object"]).head(5)

Unnamed: 0,gender,city,profession,sleep_duration,dietary_habits,degree,suicidal_thoughts,family_history
0,Male,Jaipur,Student,'7-8 hours',Moderate,'Class 12',Yes,No
1,Male,Vadodara,Student,'7-8 hours',Moderate,B.Arch,No,Yes
2,Male,Ahmedabad,Student,'7-8 hours',Unhealthy,M.Ed,Yes,Yes
3,Male,Bhopal,Student,'7-8 hours',Moderate,B.Com,Yes,No
4,Male,Patna,Student,'5-6 hours',Unhealthy,B.Com,No,No


In [7]:
## creating column list for easier access
category_columns = X_train.select_dtypes(include=["object"]).dtypes.index.tolist()
category_columns

['gender',
 'city',
 'profession',
 'sleep_duration',
 'dietary_habits',
 'degree',
 'suicidal_thoughts',
 'family_history']

### Default Changes
* This section applies all the default changes to categorical data like, 
    * removing spaces, 
    * replacing empty string with unknown (we can use some kind of prediction algorithm here but for now unknown is good since there are no empty values. )
    * lower case the values.
    * Unicode normalizer 

In [8]:
## lets check for missing values
X_train.select_dtypes(include=["object"]).isnull().sum()

gender               0
city                 0
profession           0
sleep_duration       0
dietary_habits       0
degree               0
suicidal_thoughts    0
family_history       0
dtype: int64

Luckily there are no missing values but our training pipeline should have a step to fill missing values with "unkonwn" in case production data or test data has missing values. 

In [9]:
## TODO Add this to pipeline
X_train.select_dtypes(include=["object"]).fillna("unknown", inplace=True)

Lets create pipelines to transform the data for easy exploration.

In [10]:
import unicodedata
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from pandas.api.types import is_string_dtype

## creating functional transformers

## fill na with Unknown
def fill_empty_strings_fn(df, columns=None):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    for col in df_copy.columns:
        ## TODO : Confirm if comparing with type object is correct. 
        if is_string_dtype(df_copy[col]):
            df_copy[col] = df_copy[col].fillna("unknown")
    return df_copy

## remove spaces
def strip_spaces_fn(df, colmns=None):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    for col in df_copy.columns:
        ## TODO : Confirm if comparing with type object is correct. 
        if is_string_dtype(df_copy[col]):
            df_copy[col] = df_copy[col].str.strip()
    return df_copy



def to_lower_case_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    for col in df_copy.columns:
        ## TODO : Confirm if comparing with type object is correct. 
        if is_string_dtype(df_copy[col]):
            df_copy[col] = df_copy[col].str.lower()
    return df_copy

def normalize_unicode_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    for col in df_copy.columns:
        ## TODO : Confirm if comparing with type object is correct. 
        if is_string_dtype(df_copy[col]):
            df_copy[col] = df_copy[col].map(lambda ct: unicodedata.normalize("NFKD",ct).encode("ascii","ignore").decode())
    return df_copy


fill_empty_strings = FunctionTransformer(fill_empty_strings_fn,feature_names_out="one-to-one")
strip_spaces = FunctionTransformer(strip_spaces_fn,feature_names_out="one-to-one")
to_lower_case = FunctionTransformer(to_lower_case_fn,feature_names_out="one-to-one")
normalize_unicode = FunctionTransformer(normalize_unicode_fn, feature_names_out="one-to-one")

In [11]:
## in our use case pipeline would make more sense as we need to use output of one transformer in another. 
default_cat_pipeline = Pipeline([
    ("fill_empty_strings", fill_empty_strings),
    ("strip_spaces", strip_spaces),
    ("to_lower_case", to_lower_case),
    ("normalize_unicode", normalize_unicode)    
], )

## only run the pipeline on categorical data
updated_categories = default_cat_pipeline.fit_transform(X_train.select_dtypes(include=["object", "string"]))
updated_categories.head()

Unnamed: 0,gender,city,profession,sleep_duration,dietary_habits,degree,suicidal_thoughts,family_history
0,male,jaipur,student,'7-8 hours',moderate,'class 12',yes,no
1,male,vadodara,student,'7-8 hours',moderate,b.arch,no,yes
2,male,ahmedabad,student,'7-8 hours',unhealthy,m.ed,yes,yes
3,male,bhopal,student,'7-8 hours',moderate,b.com,yes,no
4,male,patna,student,'5-6 hours',unhealthy,b.com,no,no


### Preprocessing Gender Column

In [12]:
updated_categories["gender"].value_counts()

gender
male      12437
female     9883
Name: count, dtype: int64

* Since the data is distributed between just 2 genders we can use `OneHotEncoder` to encode the data.
* Lets also use this to explore how we'll design our design our data transformation implementation. We'll need a combination of Pipelines and ColumnTransformers to make it efficient. 

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


gender_pipeline = Pipeline([
    ("default_cat_pipeline", default_cat_pipeline),
    ("encode_gender", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
])

preprocessing_gender = ColumnTransformer([
    ("preprocessing_gender", gender_pipeline, ["gender"])
])

preprocessing_gender.fit_transform(X_train)
preprocessing_gender.get_feature_names_out()

array(['preprocessing_gender__gender_female',
       'preprocessing_gender__gender_male'], dtype=object)

* So this will be our plan going forward, we'll create column specific pipelines which will be a combination of default pipeline and column specific transformations and finally combine all pipeline into one clean "preprocessing_pipeline" column transformer. 
* For now we'll keep gender as it is for easier data exploration in next section. 

### Preprocessing Profession Column

In [14]:
updated_categories["profession"].value_counts()

profession
student                     22294
architect                       6
teacher                         4
'digital marketer'              3
'content writer'                2
chef                            2
doctor                          2
pharmacist                      2
manager                         1
'educational consultant'        1
lawyer                          1
entrepreneur                    1
'civil engineer'                1
Name: count, dtype: int64

In [15]:
(updated_categories[updated_categories["profession"] != 'student'].shape[0] / updated_categories.shape[0] ) * 100

0.11648745519713263

* So majority of the profession are students, and we have less than 1% of non student instances.
* Although this lack of variance will contribute very little to our classification, for now we'll create a pipeline to change value of every non-student to 'working' and then one-hot encode them. 
* This assumption might be wrong in test data or in production, but we'll handle it when we see new information. 

In [16]:
# function to map "profession" column values to 'student' and 'working'
def map_working_profession_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy.loc[df_copy["profession"] != 'student'] = 'working'
    return df_copy

map_working_profession = FunctionTransformer(map_working_profession_fn, feature_names_out="one-to-one")

profession_pipeline = Pipeline([
    ("default_cat_pipeline", default_cat_pipeline),
    ("map_profession", map_working_profession),
    ("encode_profession", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
])

preprocessing_profession = ColumnTransformer([
    ("preprocessing_profession", profession_pipeline, ["profession"])
])

# we can uncomment this if we need pandas as output
# preprocessing_profession.set_output(transform="pandas")

temp = preprocessing_profession.fit_transform(X_train)
preprocessing_profession.get_feature_names_out()


array(['preprocessing_profession__profession_student',
       'preprocessing_profession__profession_working'], dtype=object)

* Lets also update the train set for now, to help with data exploration

In [17]:
updated_categories.loc[updated_categories["profession"] != 'student','profession'] = "working"

In [18]:
updated_categories["profession"].value_counts()

profession
student    22294
working       26
Name: count, dtype: int64

### Preprocessing Sleep Duration

In [19]:
updated_categories["sleep_duration"].value_counts()

sleep_duration
'less than 5 hours'    6646
'7-8 hours'            5871
'5-6 hours'            4963
'more than 8 hours'    4825
others                   15
Name: count, dtype: int64

In [20]:
(updated_categories[updated_categories["sleep_duration"] == 'others'].shape[0] / updated_categories.shape[0]) * 100

0.06720430107526883

* Looking at the data I think its safe to merge 'others' with 'more than 8 hours'. 
* Here is the plan,
    * We'll change the values to `lt_5`, `bt_7_8` and `gt_8` so less verbose values and column names. We might find more values in test or prod, but for now we'll assume these are the possible values. 
    * We'll merge `others` to `gt_8` since the sample size is very small.
    * We'll either one hot encode the data or ordinal encoding depending on the algorithm we want to use
* One more interesting thing that we missed earlier is that these strings have single quote in them, we'll need to strip them before processing anything


In [21]:
# helper function to rename and map the sleep duration category
from sklearn.preprocessing import OrdinalEncoder


def sleep_duration_cleanup_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy["sleep_duration"] = df["sleep_duration"].str.strip("'")
    return df_copy


sleep_duration_cleanup = FunctionTransformer(
    sleep_duration_cleanup_fn, feature_names_out="one-to-one")


def sleep_duration_mapping_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    # map 'less than 5 hours' to lt_5
    df_copy.loc[df_copy["sleep_duration"] ==
                'less than 5 hours', 'sleep_duration'] = 'lt_5'
    # map '5-6 hours' to bt_5_6
    df_copy.loc[df_copy["sleep_duration"] ==
                '5-6 hours', 'sleep_duration'] = 'bt_5_6'
    # map '7-8 hours' to bt_7_8
    df_copy.loc[df_copy["sleep_duration"] ==
                '7-8 hours', 'sleep_duration'] = 'bt_7_8'
    # more than 8 hours to gt_8
    df_copy.loc[df_copy["sleep_duration"] ==
                'more than 8 hours', 'sleep_duration'] = 'gt_8'
    # more than others to gt_8
    df_copy.loc[df_copy["sleep_duration"] ==
                'others', 'sleep_duration'] = 'gt_8'
    return df_copy


sleep_duration_mapping = FunctionTransformer(
    sleep_duration_mapping_fn, feature_names_out="one-to-one")


def make_sleep_duration_pipeline_fn(encoding="onehot"):
    steps = [("default_cat_pipeline", default_cat_pipeline),
             ("sleep_duration_cleanup", sleep_duration_cleanup),
             ("sleep_duration_mapping", sleep_duration_mapping)]

    if encoding == "onehot":
        steps.append(("encoder", OneHotEncoder(
            sparse_output=False, handle_unknown="ignore")))
    elif encoding == "ordinal":
        steps.append(("encoder", OrdinalEncoder(categories=[[
            "lt_5", "bt_5_6", "bt_7_8", "gt_8"
        ]], handle_unknown="use_encoded_value", unknown_value=-1)))
    else:
        raise ValueError("Invalid encoding type: choose 'onehot' or 'ordinal'")
    return Pipeline(steps=steps)


preprocesing_sleep_duration = ColumnTransformer([(
    "sleep_duration_pipeline", make_sleep_duration_pipeline_fn(encoding="ordinal"), [
        "sleep_duration"]
)])

temp = preprocesing_sleep_duration.fit_transform(X_train)
preprocesing_sleep_duration.get_feature_names_out()

array(['sleep_duration_pipeline__sleep_duration'], dtype=object)

* Lets update our dataset with cleaned categorical values for easier data exploration later

In [22]:
updated_categories = sleep_duration_cleanup_fn(updated_categories)
updated_categories = sleep_duration_mapping_fn(updated_categories)
updated_categories

Unnamed: 0,gender,city,profession,sleep_duration,dietary_habits,degree,suicidal_thoughts,family_history
0,male,jaipur,student,bt_7_8,moderate,'class 12',yes,no
1,male,vadodara,student,bt_7_8,moderate,b.arch,no,yes
2,male,ahmedabad,student,bt_7_8,unhealthy,m.ed,yes,yes
3,male,bhopal,student,bt_7_8,moderate,b.com,yes,no
4,male,patna,student,bt_5_6,unhealthy,b.com,no,no
...,...,...,...,...,...,...,...,...
22315,male,kolkata,student,bt_7_8,unhealthy,b.com,yes,no
22316,female,patna,student,lt_5,unhealthy,msc,yes,yes
22317,male,lucknow,student,bt_7_8,healthy,b.arch,yes,yes
22318,female,kolkata,student,bt_5_6,unhealthy,md,yes,no


### Preprocessing Dietary Habits

In [23]:
updated_categories["dietary_habits"].value_counts()

dietary_habits
unhealthy    8265
moderate     7898
healthy      6149
others          8
Name: count, dtype: int64

* This looks straight forward, we can OneHot/Ordinal encoder to encode them 
* Even here we can merge `other` with `unhealthy` since the sample size is very small

In [24]:
# helper function to map dietary habits 'other' value to unhealthy
def dietary_habits_mapping_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy.loc[df_copy["dietary_habits"] == "others","dietary_habits"] = "unhealthy"
    return df_copy


dietary_habits_mapping = FunctionTransformer(
    dietary_habits_mapping_fn, feature_names_out="one-to-one")


def make_dietary_habits_pipeline_fn(encoding='onehot'):
    steps = [
        ("default_cat_pipeline", default_cat_pipeline),
        ("dietary_habits_mapping", dietary_habits_mapping)
    ]

    if encoding == "onehot":
        steps.append(("encoder", OneHotEncoder(
            sparse_output=False, handle_unknown="ignore")))
    elif encoding == "ordinal":
        steps.append(("encoder", OrdinalEncoder(categories=[[
            "unhealthy", "moderate", "healthy"
        ]], handle_unknown="use_encoded_value", unknown_value=-1)))
    else:
        raise ValueError("Invalid encoding type: choose 'onehot' or 'ordinal'")
    return Pipeline(steps=steps)

preprocesing_dietary_habits = ColumnTransformer([(
    "dietary_habits_pipeline", make_dietary_habits_pipeline_fn(encoding="onehot"), [
        "dietary_habits"]
)])

temp = preprocesing_dietary_habits.fit_transform(X_train)
preprocesing_dietary_habits.get_feature_names_out()

array(['dietary_habits_pipeline__dietary_habits_healthy',
       'dietary_habits_pipeline__dietary_habits_moderate',
       'dietary_habits_pipeline__dietary_habits_unhealthy'], dtype=object)

* Lets update the category values for exploration

In [25]:
updated_categories = dietary_habits_mapping_fn(updated_categories)
updated_categories["dietary_habits"].value_counts()

dietary_habits
unhealthy    8273
moderate     7898
healthy      6149
Name: count, dtype: int64

### Preprocessing Degree 

In [26]:
updated_categories["degree"].value_counts()

degree
'class 12'    4808
b.ed          1487
b.com         1193
b.arch        1183
bca           1132
msc            968
b.tech         931
mca            830
m.tech         816
bhm            743
bsc            719
m.ed           672
b.pharm        654
m.com          590
bba            563
mbbs           562
llb            529
be             485
m.pharm        478
ba             477
md             473
mba            455
ma             445
phd            432
llm            380
me             143
mhm            142
others          30
Name: count, dtype: int64

* This is going to be a tricky one, but thanks to ChatGPT we have a dictionary mapping degree to field and level. 
* Here is the plan for the pipeline,
    * Step 1: Clean the values which might have "'" around them .
    * Step 2: Create two new fields degree_field and degree_level and update the values based on dictionary mapping below. We'll also use this for data exploration
    * Step 3: Create a one hot encoding function degree_field
    * Step 4: Create a generic encoding function (one hot or ordinal) for degree level (unknown < high_school < bachelor < master < doctorate). Anything else is -1

In [27]:
degree_mapping_dict = {
    "class 12":     {"field": "school",      "level": "high_school"},

    # Commerce & Business
    "b.com":        {"field": "commerce",    "level": "bachelor"},
    "m.com":        {"field": "commerce",    "level": "master"},
    "bba":          {"field": "business",    "level": "bachelor"},
    "mba":          {"field": "business",    "level": "master"},

    # Engineering & Tech
    "b.tech":       {"field": "engineering", "level": "bachelor"},
    "be":           {"field": "engineering", "level": "bachelor"},
    "b.arch":       {"field": "architecture","level": "bachelor"},
    "me":           {"field": "engineering", "level": "master"},
    "m.tech":       {"field": "engineering", "level": "master"},

    # Science & CS
    "bsc":          {"field": "science",     "level": "bachelor"},
    "msc":          {"field": "science",     "level": "master"},
    "bca":          {"field": "computer_app","level": "bachelor"},
    "mca":          {"field": "computer_app","level": "master"},

    # Education
    "b.ed":         {"field": "education",   "level": "bachelor"},
    "m.ed":         {"field": "education",   "level": "master"},

    # Medical
    "mbbs":         {"field": "medical",     "level": "bachelor"},
    "md":           {"field": "medical",     "level": "master"},  # Technically PG, but aligned here
    "b.pharm":      {"field": "pharmacy",    "level": "bachelor"},
    "m.pharm":      {"field": "pharmacy",    "level": "master"},

    # Law
    "llb":          {"field": "law",         "level": "bachelor"},
    "llm":          {"field": "law",         "level": "master"},

    # Hospitality
    "bhm":          {"field": "hospitality", "level": "bachelor"},
    "mhm":          {"field": "hospitality", "level": "master"},

    # Arts
    "ba":           {"field": "arts",        "level": "bachelor"},
    "ma":           {"field": "arts",        "level": "master"},

    # Research
    "phd":          {"field": "research",    "level": "doctorate"},

    # Other
    "others":       {"field": "unknown",     "level": "unknown"}
}

In [28]:
# helper function to clean up degree column
def degree_cleanup_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError(
            "degree_cleanup_fn : Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy["degree"] = df_copy["degree"].str.strip("'")
    return df_copy


degree_cleanup = FunctionTransformer(
    degree_cleanup_fn, feature_names_out="one-to-one")


def degree_mapping_feature_names(function_transformer, feature_names_in):
    features_out = feature_names_in.tolist()
    features_out.extend(["degree_field", "degree_level"])
    return features_out

# helper funtion to map degree to degree_field and degree_level
def map_field(val):
    return degree_mapping_dict.get(val, {}).get("field", "unknown")


def map_level(val):
    return degree_mapping_dict.get(val, {}).get("level", "unknown")


def degree_mapping_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError("degree_mapping_fn: Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy["degree_field"] = df_copy["degree"].map(map_field)
    df_copy["degree_level"] = df_copy["degree"].map(map_level)
    return df_copy


degree_mapping = FunctionTransformer(
    degree_mapping_fn, feature_names_out=degree_mapping_feature_names)

# helper function to create a column transformer to encode degree_field and degree_level fields.


def make_degree_encoder(encoding="onehot"):
    degree_level_encoder = OneHotEncoder(
        handle_unknown="ignore", sparse_output=False)
    
    if encoding == "ordinal":
        degree_level_encoder = OrdinalEncoder(categories=[[
            "unknown", "high_school", "bachelor", "master", "doctorate"
        ]], handle_unknown="use_encoded_value", unknown_value=-1)

    return ColumnTransformer([
        ("encode_degree_field", OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["degree_field"]),
        ("encode_degree_level", degree_level_encoder, ["degree_level"])
    ])


# testing basic pipeline
degree_pipeline = Pipeline([
    ("default_cat_pipeline", default_cat_pipeline),
    ("cleanup", degree_cleanup),
    ("mapping", degree_mapping),
    ("encode_degree",make_degree_encoder())
])

preprocessing_degree = ColumnTransformer([
    ("degree_pipeline", degree_pipeline, ["degree"])
])
temp = preprocessing_degree.fit_transform(X_train)
preprocessing_degree.get_feature_names_out()

array(['degree_pipeline__encode_degree_field__degree_field_architecture',
       'degree_pipeline__encode_degree_field__degree_field_arts',
       'degree_pipeline__encode_degree_field__degree_field_business',
       'degree_pipeline__encode_degree_field__degree_field_commerce',
       'degree_pipeline__encode_degree_field__degree_field_computer_app',
       'degree_pipeline__encode_degree_field__degree_field_education',
       'degree_pipeline__encode_degree_field__degree_field_engineering',
       'degree_pipeline__encode_degree_field__degree_field_hospitality',
       'degree_pipeline__encode_degree_field__degree_field_law',
       'degree_pipeline__encode_degree_field__degree_field_medical',
       'degree_pipeline__encode_degree_field__degree_field_pharmacy',
       'degree_pipeline__encode_degree_field__degree_field_research',
       'degree_pipeline__encode_degree_field__degree_field_school',
       'degree_pipeline__encode_degree_field__degree_field_science',
       'degree_pip

* Lets cleanup and create the degree mapping columns for data exploration

In [29]:
updated_categories = degree_cleanup_fn(updated_categories)
updated_categories = degree_mapping_fn(updated_categories)
updated_categories.head()

Unnamed: 0,gender,city,profession,sleep_duration,dietary_habits,degree,suicidal_thoughts,family_history,degree_field,degree_level
0,male,jaipur,student,bt_7_8,moderate,class 12,yes,no,school,high_school
1,male,vadodara,student,bt_7_8,moderate,b.arch,no,yes,architecture,bachelor
2,male,ahmedabad,student,bt_7_8,unhealthy,m.ed,yes,yes,education,master
3,male,bhopal,student,bt_7_8,moderate,b.com,yes,no,commerce,bachelor
4,male,patna,student,bt_5_6,unhealthy,b.com,no,no,commerce,bachelor


### Preprocessing Suicidal Thoughts Column

In [30]:
updated_categories["suicidal_thoughts"].value_counts()

suicidal_thoughts
yes    14133
no      8187
Name: count, dtype: int64

* This seems straight forward, and data seems clean enough. We can simply one hot encode this column

In [31]:
suicidal_thoughts_pipeline = Pipeline([
    ("default_cat_pipeline", default_cat_pipeline),
    ("suididal_thoughts_encoding", OneHotEncoder(
        handle_unknown="ignore", sparse_output=False))
])

preprocessing_suicidal_thoughts = ColumnTransformer(
    [(
        "suicidal_thoughts_pipeline", suicidal_thoughts_pipeline,["suicidal_thoughts"]
    )]
)

temp = preprocessing_suicidal_thoughts.fit_transform(X_train)
preprocessing_suicidal_thoughts.get_feature_names_out()

array(['suicidal_thoughts_pipeline__suicidal_thoughts_no',
       'suicidal_thoughts_pipeline__suicidal_thoughts_yes'], dtype=object)

### Preprocessing Family History

In [32]:
updated_categories["family_history"].value_counts()

family_history
no     11517
yes    10803
Name: count, dtype: int64

* Even this seems straight forward a simple one hot encoding should make this column ready for training. 

In [33]:
family_history_pipeline = Pipeline([
    ("default_cat_pipeline", default_cat_pipeline),
    ("family_history_encoding", OneHotEncoder(
        handle_unknown="ignore", sparse_output=False))
])

preprocessing_family_history = ColumnTransformer(
    [(
        "family_history_pipeline", family_history_pipeline,["family_history"]
    )]
)

temp = preprocessing_family_history.fit_transform(X_train)
preprocessing_family_history.get_feature_names_out()

array(['family_history_pipeline__family_history_no',
       'family_history_pipeline__family_history_yes'], dtype=object)

### Preprocessing 'city' column

In [34]:
updated_categories["city"].value_counts()

city
kalyan                  1284
srinagar                1073
hyderabad               1063
vasai-virar             1042
lucknow                  943
thane                    910
kolkata                  890
agra                     864
ludhiana                 848
surat                    842
jaipur                   840
patna                    823
visakhapatnam            763
pune                     751
bhopal                   748
ahmedabad                748
chennai                  707
meerut                   660
rajkot                   633
bangalore                625
delhi                    602
ghaziabad                588
mumbai                   563
vadodara                 561
varanasi                 550
nagpur                   533
indore                   519
kanpur                   493
nashik                   452
faridabad                381
harsha                     2
bhavna                     2
saanvi                     2
city                       2
khaziabad

* So city column needs some clearning, 
    * There might be cities with special characters "'" that needs to be cleaned up.
    * There are some values which are obviously not a city but rather distances, person names and education degree names. To fix this we'll use the master city dataset that has verified city names and lat/long info that we'll add as additional columns
    * We'll also add a `is_valid_city` flag and for in valid city names we'll use `Nagpur` as default name and default lat/long as it is considered to be the geographical center of India. 
    * We don't need encoding for the city but rather we'll use cluster similarity to find the similarity between clusters. We **might** do that after some data exploration.  


In [36]:
## load the master city data
## read master city list
master_city_list = pd.read_csv(Path(data_dir,"detailed_in.csv"))


## convert the city names to lower case
# master_city_list["name"] = master_city_list["name"].str.strip().str.lower()
# master_city_list["ascii_name"] = master_city_list["ascii_name"].str.strip().str.lower()

# ## unicode normalization
# master_city_list["name"] = master_city_list["name"].map(lambda ct: unicodedata.normalize("NFKD",ct).encode("ascii","ignore").decode())
# master_city_list["ascii_name"] = master_city_list["ascii_name"].map(lambda ct: unicodedata.normalize("NFKD",ct).encode("ascii","ignore").decode())

# X_train["city"].map(lambda ct: unicodedata.normalize("NFKD",ct).encode("ascii","ignore").decode())
# master_cities = master_city_list["ascii_name"].str.strip().str.lower().to_list()
# master_cities[:5]
master_city_list.head()

Unnamed: 0,name,ascii_name,lat,long
0,#100 bed and breakfast,#100 bed and breakfast,12.98332,77.58427
1,10 calangute,10 calangute,15.54244,73.76279
2,100 feet hospital,100 feet hospital,19.38609,72.82558
3,12th avenue hotel,12th avenue hotel,12.97044,77.64617
4,1589 city mark hotel,1589 city mark hotel,28.46348,77.03176


In [37]:
master_city_list["ascii_name"].value_counts()

ascii_name
#100 bed and breakfast    1
narayanpur mardan         1
narayanpur majhari        1
narayanpur main canal     1
narayanpur mafi           1
                         ..
gyadal gondi              1
gya                       1
gwinai                    1
gwilani                   1
zuvvigunta                1
Name: count, Length: 407781, dtype: int64

In [110]:
# helper function to clean up city column and remove special characters
def city_cleanup_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError(
            "degree_cleanup_fn : Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy["city"] = df_copy["city"].str.strip("'")
    return df_copy

city_cleanup = FunctionTransformer(city_cleanup_fn, feature_names_out="one-to-one")

# helper function that maps city names to default values of is_valid_city = 0, lat/long of Nagpur
def city_mapping_feature_names(function_transformer, feature_names_in):
    features_out = feature_names_in.tolist()
    features_out.extend(["is_valid_city","lat", "long"])
    return features_out

def city_mapping_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError(
            "degree_cleanup_fn : Input must be a pandas DataFrame")
    df_copy = df.copy()
    df_copy["is_valid_city"] = 0
    df_copy["lat"] = 21.122615
    df_copy["long"] = 79.041124
    return df_copy

city_mapping = FunctionTransformer(city_mapping_fn, feature_names_out=city_mapping_feature_names)

# helper function that compares and verifies the city name and if its a valid city then updates the lat/long value
def map_city_data(city_name):
    ## search for city name in master city lsit
    city_data = master_city_list.loc[master_city_list["name"] == city_name]
    ## if city exists then return valid info
    if city_data.shape[0] > 0:
        return (1, city_data["lat"].values[0],city_data["long"].values[0])
    ## if city doesn't exist the mark it as invalid and return default info
    return 0,21.122615,79.041124
    

def city_verification_fn(df):
    if not isinstance(df, pd.DataFrame):
        raise ValueError(
            "degree_cleanup_fn : Input must be a pandas DataFrame")
    df_copy = df.copy()
    unique_cities = df_copy["city"].unique()
    for unique_city in unique_cities:
        is_valid_city,lat,long = map_city_data(unique_city)
        df_copy.loc[df_copy["city"] == unique_city, "is_valid_city"] = is_valid_city
        df_copy.loc[df_copy["city"] == unique_city, "lat"] = lat
        df_copy.loc[df_copy["city"] == unique_city, "long"] = long
    return df_copy

city_verification = FunctionTransformer(city_verification_fn, feature_names_out="one-to-one")

# function to fuzzy match invalid cities to master city list.
# from rapidfuzz import process, fuzz,utils

# unique_master_cites = master_city_list["name"].values
# def map_fuzzy_matched_city(city_name):
#     matched_list = process.extract(city_name, unique_master_cites, scorer=fuzz.QRatio,limit=1)
#     match,score,_ = matched_list[0]
#     print(city_name, match,score)

# def fuzzy_city_mapper_fn(df):
#     if not isinstance(df, pd.DataFrame):
#         raise ValueError(
#             "degree_cleanup_fn : Input must be a pandas DataFrame")
#     df_copy = df.copy()
#     ## only focuses on invalid city
#     unique_cities = df_copy.loc[df_copy["is_valid_city"] == 0,"city"].unique()
#     for unique_city in unique_cities:
#         map_fuzzy_matched_city(unique_city)
#     return df_copy

# fuzzy_city_mapping = FunctionTransformer(fuzzy_city_mapper_fn, feature_names_out="one-to-one")

In [111]:
city_pipeline = Pipeline([
    ("default_cat_pipeline", default_cat_pipeline),
    ("city_cleanup", city_cleanup),
    ("city_mapping", city_mapping),
    ("city_verification", city_verification),
    # ("fuzzy_city_mapping", fuzzy_city_mapping)
])

preprocessing_city = ColumnTransformer([
    ("city_pipeline", city_pipeline, ["city"])
])


temp = preprocessing_city.fit_transform(X_train)

In [112]:
temp_df = pd.DataFrame(temp, columns=preprocessing_city.get_feature_names_out())
temp_df["city_pipeline__is_valid_city"].value_counts()


city_pipeline__is_valid_city
1    20634
0     1686
Name: count, dtype: int64

In [113]:
(temp_df.loc[temp_df["city_pipeline__is_valid_city"] == 0].shape[0]/temp_df.loc[temp_df["city_pipeline__is_valid_city"] == 1].shape[0])*100

8.170979936027916

* So we still have 8% of data with invalid cities. For now we've decided to keep it as it is and experiment with ML models to see how this affects the training.
* For invalid cities we have a flag indicating whether the cities are valid or invalid, and we have lat/long defaulting to Nagpur. 
* Next step would be to explore non-categorical data.