## Importing general libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Importing dataset

### Let's understand dataset. Why?
To understand want action needs to be taken (such as : if dataset has null values or any other gibberish values, so we have to handle those missing values and if dataset is to small, then we will have to club this dataset with some other similar data for better results). You get the idea right!

In [2]:
df = pd.read_csv('../ML Datasets.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Let's see how many rows of data we orginally have. \
for printing number of rows in df either df.shape[0] or len(df) can be used.

In [3]:
print(f"Number of rows/datasets in dataframe : {len(df)}")

Number of rows/datasets in dataframe : 7043


Let's start with whether this dataset has missing values to fill in. \
Let's see how many null values are their in each column.

In [4]:
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

No Null values in any column. \
So, let's see how many unique categories are there in each column having object as it's datatype. 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


18 columns have object datatype. So, let's print all thew categories in each column.

In [6]:
for each_dtype, each_column in zip(df.dtypes, df.columns):
    if (each_dtype == 'object'):
        print(f"{each_column} has total of {len(df[each_column].unique())} types of category")

customerID has total of 7043 types of category
gender has total of 2 types of category
Partner has total of 2 types of category
Dependents has total of 2 types of category
PhoneService has total of 2 types of category
MultipleLines has total of 3 types of category
InternetService has total of 3 types of category
OnlineSecurity has total of 3 types of category
OnlineBackup has total of 3 types of category
DeviceProtection has total of 3 types of category
TechSupport has total of 3 types of category
StreamingTV has total of 3 types of category
StreamingMovies has total of 3 types of category
Contract has total of 3 types of category
PaperlessBilling has total of 2 types of category
PaymentMethod has total of 4 types of category
TotalCharges has total of 6531 types of category
Churn has total of 2 types of category


After looking at this data we can understand that most of the column have only few category, which is how it should be. \
And few more things, which are that we don't really need `customerID` column for anyalsis or model making and also \
that one column which is `TotalCharges` should be of type float or numeric, but it is of object type (Which should be changed).

We, will try to detect and remove outlier from the dataset if found.
After that we will have to transform the columns with categorical/object dtype to numeric using Encoders.
More on that we can noramlize the values between 0 - 1 and standardize parameters like mean and standard deviation.
And much more depending on mood..


### Removing unwanted columns and fixing dtype errors.

In [7]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Removing `customerId` column

In [8]:
df.drop(columns='customerID', axis=1, inplace=True)
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Changing `TotalCharges` column dtype

In [9]:
print(f"Current dtype : {df.TotalCharges.dtype}")
df = df.astype({
    'TotalCharges' : 'float32',
})
print(f"After dtype :  {df.TotalCharges.dtype}")

Current dtype : object


ValueError: could not convert string to float: ' ': Error while type casting for column 'TotalCharges'

Here, during changing all column values of TotalCharges we encountered a error stating that there is empty string in our dataset, \
so we need to fix that, but using fillings or imputer. We will go with imputer.

In [10]:
for each in df.columns:
    count = 0
    for each_row in df[each]:
        if each_row == ' ':
            count += 1
    print(f"Number of empty values in column {each} is {count}.")

Number of empty values in column gender is 0.
Number of empty values in column SeniorCitizen is 0.
Number of empty values in column Partner is 0.
Number of empty values in column Dependents is 0.
Number of empty values in column tenure is 0.
Number of empty values in column PhoneService is 0.
Number of empty values in column MultipleLines is 0.
Number of empty values in column InternetService is 0.
Number of empty values in column OnlineSecurity is 0.
Number of empty values in column OnlineBackup is 0.
Number of empty values in column DeviceProtection is 0.
Number of empty values in column TechSupport is 0.
Number of empty values in column StreamingTV is 0.
Number of empty values in column StreamingMovies is 0.
Number of empty values in column Contract is 0.
Number of empty values in column PaperlessBilling is 0.
Number of empty values in column PaymentMethod is 0.
Number of empty values in column MonthlyCharges is 0.
Number of empty values in column TotalCharges is 11.
Number of empty

We will iterate through all values in this column and change ' ' value to null. But we don't need to do we manually, instead we will use replace function.

In [11]:
df.loc[df.TotalCharges == ' ' ,'TotalCharges']

488      
753      
936      
1082     
1340     
3331     
3826     
4380     
5218     
6670     
6754     
Name: TotalCharges, dtype: object

In [12]:
df["TotalCharges"].replace(' ', np.nan, inplace=True)

In [13]:
for each in df.columns:
    count = 0
    for each_row in df[each]:
        if each_row == " ":
            count += 1
    print(f"Number of empty values in column {each} is {count}.")

Number of empty values in column gender is 0.
Number of empty values in column SeniorCitizen is 0.
Number of empty values in column Partner is 0.
Number of empty values in column Dependents is 0.
Number of empty values in column tenure is 0.
Number of empty values in column PhoneService is 0.
Number of empty values in column MultipleLines is 0.
Number of empty values in column InternetService is 0.
Number of empty values in column OnlineSecurity is 0.
Number of empty values in column OnlineBackup is 0.
Number of empty values in column DeviceProtection is 0.
Number of empty values in column TechSupport is 0.
Number of empty values in column StreamingTV is 0.
Number of empty values in column StreamingMovies is 0.
Number of empty values in column Contract is 0.
Number of empty values in column PaperlessBilling is 0.
Number of empty values in column PaymentMethod is 0.
Number of empty values in column MonthlyCharges is 0.
Number of empty values in column TotalCharges is 0.
Number of empty 

Now we can change all values to float in column TotalCharges.

In [14]:
df = df.astype({
    'TotalCharges' : 'float',
})

print(f"After dtype : {df.TotalCharges.dtype}")

After dtype : float64


In [15]:
df.TotalCharges.info()

<class 'pandas.core.series.Series'>
RangeIndex: 7043 entries, 0 to 7042
Series name: TotalCharges
Non-Null Count  Dtype  
--------------  -----  
7032 non-null   float64
dtypes: float64(1)
memory usage: 55.2 KB


Let's fill these empty values using sklearn imputer.

In [16]:
df.TotalCharges.isnull().sum()

np.int64(11)

### Changing None value in column `TotalCharges` to mean values using SimpleImputer

In [17]:
from sklearn.impute import SimpleImputer 

In [18]:
df.TotalCharges.isna().sum()


np.int64(11)

Let's how does None dtype look.

In [19]:
df.iloc[488]

gender                                 Female
SeniorCitizen                               0
Partner                                   Yes
Dependents                                Yes
tenure                                      0
PhoneService                               No
MultipleLines                No phone service
InternetService                           DSL
OnlineSecurity                            Yes
OnlineBackup                               No
DeviceProtection                          Yes
TechSupport                               Yes
StreamingTV                               Yes
StreamingMovies                            No
Contract                             Two year
PaperlessBilling                          Yes
PaymentMethod       Bank transfer (automatic)
MonthlyCharges                          52.55
TotalCharges                              NaN
Churn                                      No
Name: 488, dtype: object

Let's pass in numpy array of column `TotalCharges` to fit_transform method in SimpleImputer. \
Make sure that missing_value=np.nan (Should be exactly what you want to replace).

In [20]:
imputer = SimpleImputer(missing_values=pd.NA, strategy='mean')

reshape_column = df.TotalCharges.to_numpy()
reshape_column = reshape_column.reshape(-1, 1)
print(reshape_column.shape)
df['TotalCharges'] = imputer.fit_transform(reshape_column)  

(7043, 1)


In [21]:
df.TotalCharges.isna().sum()

np.int64(0)

Let's check back the value at index 488.\
It has changed to mean value.

In [22]:
df.iloc[488]

gender                                 Female
SeniorCitizen                               0
Partner                                   Yes
Dependents                                Yes
tenure                                      0
PhoneService                               No
MultipleLines                No phone service
InternetService                           DSL
OnlineSecurity                            Yes
OnlineBackup                               No
DeviceProtection                          Yes
TechSupport                               Yes
StreamingTV                               Yes
StreamingMovies                            No
Contract                             Two year
PaperlessBilling                          Yes
PaymentMethod       Bank transfer (automatic)
MonthlyCharges                          52.55
TotalCharges                      2283.300441
Churn                                      No
Name: 488, dtype: object

### Converting Categorical columns to numeric. Why?
To make machine learning model understand the pattern between different columns. As machine don't understand string/ our language.
But before doing this we will do EDA, because after converting all categorical values to numeric it will be tricky to do EDA.
So Let's save the df and do EDA on it.

#### Saving df

In [35]:
df.head(1)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No


In [38]:
df.to_csv('../Data/ML Dataset cleaned.csv')

Now dataset is saved in Data folder in parent dir. We will perform LabelEncoding in EDA notebook itself.