# CS3802--Machine Learning Algorithms Lab

Adithya V |
BTech CSE (IoT) - A | 21011102009

## Exercise 3
---
### Use the dataset, perform necessary pre-processing and build a logistic regression model. divide the data into 70:30 ratio and print the performance metrics.

### Importing the necessary libraries 

In [33]:
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score
import math
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Reading the dataset

In [2]:
data = pd.read_csv('telecom_customer_churn.csv')
data.head()

Unnamed: 0,Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,...,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason
0,0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,...,Credit Card,65.6,593.3,0.0,0,381.51,974.81,1,,
1,0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,...,Credit Card,-4.0,542.4,38.33,10,96.21,610.28,1,,
2,0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,...,Bank Withdrawal,73.9,280.85,0.0,0,134.6,415.45,0,Competitor,Competitor had better devices
3,0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,...,Bank Withdrawal,98.0,1237.85,0.0,0,361.66,1599.51,0,Dissatisfaction,Product dissatisfaction
4,0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,...,Credit Card,83.9,267.4,0.0,0,22.14,289.54,0,Dissatisfaction,Network reliability


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6589 entries, 0 to 6588
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        6589 non-null   object 
 1   Gender                             6589 non-null   object 
 2   Age                                6589 non-null   int64  
 3   Married                            6589 non-null   object 
 4   Number of Dependents               6589 non-null   int64  
 5   City                               6589 non-null   object 
 6   Zip Code                           6589 non-null   int64  
 7   Latitude                           6589 non-null   float64
 8   Longitude                          6589 non-null   float64
 9   Number of Referrals                6589 non-null   int64  
 10  Tenure in Months                   6589 non-null   int64  
 11  Offer                              6589 non-null   objec

###  Pre-Processing 

##### Remove outliers using IQR method

The `IQR_Removal` function takes a DataFrame (`df`) as input and performs the following steps:

1. **Initialization:**
   - Retrieves column names from the DataFrame (`columns = df.columns`).

2. **Iterate Through Columns:**
   - For each column (`col`) in the DataFrame:
      - Skips the 'SalePrice' column.

3. **Check Column Type:**
   - Verifies if the column is not of type 'object' (i.e., numerical).

4. **Remove Outliers Using IQR:**
   - Calculates the first quartile (Q1), third quartile (Q3), and Interquartile Range (IQR) for the numerical column.
   - Filters rows to keep only those within the range of `(Q1 - 1.5 * IQR)` to `(Q3 + 1.5 * IQR)`.

5. **Return Updated DataFrame:**
   - Returns the DataFrame with outliers removed from numerical columns.

In [4]:
def IQR_Removal(df):
    
    columns = df.columns
    for col in columns:
        if col == 'SalePrice':
            continue
        typeCol = str(df[col].dtype)
        if typeCol != 'object':
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            iqr = Q3 - Q1
            df = df[(df[col] >= Q1 - 1.5*iqr) & (df[col] <= Q3 + 1.5*iqr)]
    return df

#### Remove columns with only one unique value and columns with null ratio >= 0.30

The `ThresholdandND_columnRemoval` function takes a DataFrame (`df`) as input and performs the following steps:

1. **Calculate Length and Columns:**
   - Retrieves the length of the DataFrame (`N = len(df)`) and column names (`columns = df.columns`).

2. **Iterate Through Columns:**
   - For each column (`col`) in the DataFrame:
      - Checks if the number of unique values in the column is equal to 1. If true, drops the column as it lacks diversity.

3. **Check Null Ratio:**
   - Calculates the ratio of null values in the column (`notnull = df[col].isnull().sum()`) and checks if it is greater than or equal to 30%.
   - If true, drops the column as it exceeds the specified null ratio threshold.

4. **Return Updated DataFrame:**
   - Returns the DataFrame with columns removed based on the threshold and no-diversity criteria.


In [5]:
def ThresholdandND_columnRemoval(df):
    
    N = len(df)
    columns = df.columns
    for col in columns:
        if (len(df[col].unique()) == 1):
            df = df.drop([col], axis=1)
            continue
        notnull = df[col].isnull().sum()
        ratio = notnull / N
        if (ratio >= 0.30):
            df = df.drop([col], axis=1)
    return df


#### Handle null values by either removing rows or filling with mean/median

The `Handling_NullValues` function takes a DataFrame (`df`) as input and performs the following steps:

1. **Iterate Through Columns:**
   - For each column (`col`) in the DataFrame:
      - Checks the data type of the column (`typeCol = str(df[col].dtype)`).

2. **Handle Null Values for Object Type:**
   - If the column type is 'object' (categorical):
      - Removes rows with null values for that column (`df = df[df[col].notna()]`).

3. **Handle Null Values for Numeric Type:**
   - If the column type is numeric:
      - Calculates mean, median, and standard deviation of the column (`mean = df[col].mean()`, `median = df[col].median()`, `standard_deviation = df[col].std()`).

4. **Partial Median Change (PMC) Criteria:**
   - Calculates Partial Median Change (PMC) using the formula `pmc = (3 * (mean - median)) / standard_deviation`.
   - If PMC is greater than or equal to 0.4 or less than or equal to -0.4:
      - Fills null values with the median (`df[col] = df[col].fillna(median)`).
   - Otherwise:
      - Fills null values with the mean (`df[col] = df[col].fillna(mean)`).

5. **Return Updated DataFrame:**
   - Returns the DataFrame with missing values handled based on data type and PMC criteria.



In [6]:
def Handling_NullValues(df):
   
    columns = df.columns
    for col in columns:
        typeCol = str(df[col].dtype)
        if typeCol == 'object':
            df = df[df[col].notna()]
        else:
            mean = df[col].mean()
            median = df[col].median()
            standard_deviation = df[col].std()
            pmc = (3 * (mean - median)) / standard_deviation
            if pmc >= 0.4 or pmc <= -0.4:
                df[col] = df[col].fillna(median)
            else:
                df[col] = df[col].fillna(mean)
    return df

#### Perform one-hot encoding for categorical columns

The `OneHotEncoding_objects` function encodes categorical (object-type) columns using one-hot encoding:

1. **Iterate Through Columns:**
   - For each column (`col`) in the DataFrame:
      - Check if the column type is 'object'.

2. **One-Hot Encode Object Columns:**
   - If the column is 'object':
      - Use `pd.get_dummies` to create one-hot encoded columns.

3. **Rename and Join Encoded Columns:**
   - Rename the new columns by appending the original column name as a prefix.
   - Join the one-hot encoded columns to the original DataFrame.

4. **Drop Original Object Column:**
   - Drop the original object-type column.

5. **Return Updated DataFrame:**
   - Returns the DataFrame with one-hot encoded object-type columns.

In [7]:
def OneHotEncoding_objects(df):
   
    columns = df.columns
    for col in columns:
        typeCol = str(df[col].dtype)
        if typeCol == 'object':
            enc = pd.get_dummies(df[col])
            encCol = enc.columns
            newColumns = {}
            for i in range(0, len(encCol)):
                newColumns[encCol[i]] = col + encCol[i]
            enc.rename(columns=newColumns, inplace=True)
            df = df.join(enc)
            df = df.drop([col], axis=1)
    return df

### Result of Pre-Processing

In [8]:
df = OneHotEncoding_objects(IQR_Removal(Handling_NullValues(ThresholdandND_columnRemoval(data))))
df.head()

Unnamed: 0,Age,Number of Dependents,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,...,Unlimited DataNo,Unlimited DataYes,ContractMonth-to-Month,ContractOne Year,ContractTwo Year,Paperless BillingNo,Paperless BillingYes,Payment MethodBank Withdrawal,Payment MethodCredit Card,Payment MethodMailed Check
0,37,0,93225,34.827662,-118.999073,2,9,42.39,16.0,65.6,...,0,1,0,1,0,0,1,0,1,0
2,50,0,92627,33.645672,-117.922613,0,4,33.65,30.0,73.9,...,0,1,1,0,0,0,1,1,0,0
3,78,0,94553,38.014457,-122.115432,1,13,27.82,4.0,98.0,...,0,1,1,0,0,0,1,1,0,0
6,67,0,93437,34.757477,-120.550507,1,71,9.96,14.0,109.7,...,0,1,0,0,1,0,1,1,0,0
8,68,0,93063,34.296813,-118.685703,0,7,10.53,21.0,48.2,...,0,1,0,0,1,0,1,1,0,0


### Robust Scaling

In [12]:
scaler = RobustScaler()
cols = df.columns
data_scale = scaler.fit_transform(df.to_numpy())
data_scale = pd.DataFrame(data_scale, columns=cols)
data_scale

Unnamed: 0,Age,Number of Dependents,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,...,Unlimited DataNo,Unlimited DataYes,ContractMonth-to-Month,ContractOne Year,ContractTwo Year,Paperless BillingNo,Paperless BillingYes,Payment MethodBank Withdrawal,Payment MethodCredit Card,Payment MethodMailed Check
0,-0.518519,0.0,-0.087323,-0.263945,0.127913,2.0,-0.289474,0.751416,-0.1250,-0.652482,...,0.0,0.0,-1.0,1.0,0.0,0.0,0.0,-1.0,1.0,0.0
1,-0.037037,0.0,-0.273819,-0.548445,0.408869,0.0,-0.421053,0.384728,0.7500,-0.316109,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.000000,0.0,0.326836,0.503103,-0.685456,1.0,-0.184211,0.140130,-0.8750,0.660588,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.592593,0.0,-0.021207,-0.280838,-0.277011,1.0,1.342105,-0.609188,-0.2500,1.134752,...,0.0,0.0,-1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.629630,0.0,-0.137845,-0.391718,0.209703,0.0,-0.342105,-0.585274,0.1875,-1.357649,...,0.0,0.0,-1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2014,0.148148,0.0,0.760019,0.999556,-0.523307,0.0,0.210526,0.046570,-0.6875,-1.096251,...,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,0.0,0.0,0.0
2015,0.962963,0.0,-0.433183,-0.760347,0.617682,1.0,0.868421,0.971680,0.4375,0.498480,...,0.0,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2016,-1.148148,0.0,-1.086231,-0.457428,0.347803,0.0,-0.342105,0.503881,1.5000,0.500507,...,0.0,0.0,-1.0,1.0,0.0,0.0,0.0,-1.0,1.0,0.0
2017,0.074074,0.0,0.038360,0.212617,0.153368,0.0,-0.500000,0.738829,-0.5625,-0.468085,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,1.0,0.0


### Training the Model

In [14]:
target = data_scale['Customer Status']
ivCol = list(data_scale.columns)
ivCol.remove('Customer Status')
independent_variables = data_scale[ivCol]
independent_variables
x_train, x_test, y_train, y_test = train_test_split(independent_variables, target, test_size=0.3,random_state=6789,shuffle=True)
logisticRegr = LogisticRegression()
logisticRegr.fit(x_train, y_train)

### Preediction

In [35]:
y_pred = logisticRegr.predict(x_test)
y_pred

accuracy = accuracy_score(y_test,y_pred)
accuracy

0.7788778877887789