<a href="https://colab.research.google.com/github/Yaminipampana/Data-Science-Internship-SKS/blob/main/Task_1_Data_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TASK 1: DATA PREPARATION ##

# Description:
In this task, you will be responsible for loading the dataset and conducting an initial exploration. Handle missing values,and if necessary, convert categorical variables into numerical representations. Futhermore, split the dataset into training and testing sets for subsequent moel evalution.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Data Loading and Initial Exploration

* starts by loading a customer churn dataset from a file. It then prints out the first few rows, some general info(like the types of data in each columns), basic statistics(like averages and ranges), and how many customers churned versus how many didn't.

In [2]:
df = pd.read_csv('/content/Telco_Customer_Churn_Dataset  (3).csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [5]:
df['Churn'].value_counts()

Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
No,5174
Yes,1869


# 2. Handling Missing Values:
* checks if any values are missing in the dataset. specifically, it looks at a column called "TotalCharges" which should be a number, but might have some errors.It converts any non-numeric values to 'missing" and then fills those missing spots with the median (the middle values) of the "TotalCharges" column. This prevents errors when the model tries use the data

In [6]:
print(df.isnull().sum())

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [7]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [8]:
df['TotalCharges'].fillna(df['TotalCharges'].median())

Unnamed: 0,TotalCharges
0,29.85
1,1889.50
2,108.15
3,1840.75
4,151.65
...,...
7038,1990.50
7039,7362.90
7040,346.45
7041,306.60


In [9]:
df.isnull().sum()

Unnamed: 0,0
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


# 3. Data Preprocessing and Categorical Variables Encoding
* It first identifies which columns contain text (categorical data).
* Then, it removes "customerID" as it's not useful for prediction.
For columns with "Yes/No" answers (like "Has Partner?"), it changes "Yes" to 1 and "No" to 0.
* For the "SeniorCitizen" column (which contains values 0 and 1), it is processed to ensure it's properly treated as a numerical category.
* Finally, for the remaining text columns (like "Internet Service Type"), it creates new columns for each category. For example, "InternetService" might become "InternetService_FiberOptic" and "InternetService_DSL" columns, with 1s and 0s indicating which service the customer has. This is called "one-hot encoding".

In [10]:
# Identify categorical cloumns(excluding `customerID' and 'Churn')
categorical_cols = df.select_dtypes(include='object').columns.tolist()
categorical_cols.remove('customerID') # Remove ID column

In [11]:

# Drop the ID column
df = df.drop('customerID', axis=1)


In [12]:
binary_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0, 'Male': 1, 'Female': 0})

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['SeniorCitizen'] = le.fit_transform(df['SeniorCitizen'])

In [14]:
df = pd.get_dummies(df, columns=[col for col in categorical_cols if col not in binary_cols], drop_first=True)

df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,...,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,0,1,0,1,0,1,29.85,29.85,0,...,False,False,False,False,False,False,False,False,True,False
1,1,0,0,0,34,1,0,56.95,1889.5,0,...,False,False,False,False,False,True,False,False,False,True
2,1,0,0,0,2,1,1,53.85,108.15,1,...,False,False,False,False,False,False,False,False,False,True
3,1,0,0,0,45,0,0,42.3,1840.75,0,...,True,False,False,False,False,True,False,False,False,False
4,0,0,0,0,2,1,1,70.7,151.65,1,...,False,False,False,False,False,False,False,False,True,False


# 4. Dataset Splitting
* The code divides the data into two sets: a training set and a testing set. The training set is used to teach the machine learning model, and the testing set is used to see how well the model learned.
* It separates the features (the data used to predict, like demographics and service usage) from the target variable (what we're trying to predict, in this case, whether the customer churned or not). The features are assigned to "X" and the target to "y".
* It splits "X" and "y" into training and testing sets, using 80% of the data for training and 20% for testing. "random_state=42" ensures that the split is the same every time you run the code, which is important for reproducibility.
* Finally, it prints the size (shape) of each of the resulting sets.

In [15]:
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
print("X_train shape:" , X_train.shape)
print("X_test shape:" , X_test.shape)
print("y_train shape:" , y_train.shape)
print("y_test shape:" , y_test.shape)

X_train shape: (5634, 30)
X_test shape: (1409, 30)
y_train shape: (5634,)
y_test shape: (1409,)
