# Working on the Datasets

We will use the pandas library to read the datasets. We will use the `read_csv` method to read the csv files in python and assign it to a dataframe object.

### Required Libraries

- pandas
- matplotlib
- numpy
- scikit-learn


To install the libraries, you can use `pip` or `conda` as follows:

```bash
pip install pandas matplotlib scikit-learn numpy
```

```bash
conda install pandas matplotlib scikit-learn numpy
```


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

### About the dataset

This dataset can be found on the following [link](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).

The dataset contains 7043 rows and 21 columns.

The dataset contains information about a telecom company that has a problem with the churn rate of its customers. The company wants to know which customers are likely to churn next month. The dataset contains information about:

- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about the customers

In [None]:
# function to load data
def load_data(file_name):
    # load dataset
    dataset = pd.read_csv(file_name)
    
    # drop the customerID column
    dataset.drop('customerID', axis = 1, inplace = True)
    
    # replace the cells with single and extra spaces with a Nan value
    dataset.replace(r'^\s*$', np.nan, regex=True, inplace = True)
    
    # convert the TotalCharges column to numeric
    dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'])
    
    # convert the dataset to dataframe
    dataset = pd.DataFrame(dataset)
    
    return dataset

dataset = load_data('Telco-Customer-Churn.csv')

# How many rows and columns are in this dataset?
print(dataset.shape)

# Store the column of data types in a list called data_types
data_types = dataset.dtypes

# Print the data type of each column
print(data_types)

# What is the first entry in the dataset?
print(dataset.head(2))

# What is the last entry in the dataset?
print(dataset.tail(2))

# What is the name of each column?
print(dataset.columns)

### Checking for missing values

We will check for missing values in the dataset. If there are any missing values, we will replace them with the mean of the column.

In [None]:
dataset.isnull().sum()   # check for missing values

In [None]:
# function to find the missing columns
def find_missing_columns(dataset):
    dataset_columns_length = len(dataset.columns)
    # Store the column number of the columns with missing values in a list called missing_cols
    missing_cols = [i for i in range(dataset_columns_length) if dataset.iloc[:, i].isnull().any()]
    
    # Print columns index and names with missing values and the number of missing values
    for i in missing_cols:
        print(i, dataset.columns[i], dataset.iloc[:, i].isnull().sum())
            
    return missing_cols

# Store the column number of the columns with missing values in a list called missing_cols
missing_cols = find_missing_columns(dataset)

print(missing_cols)


In [None]:

def impute_missing_values(dataset, missing_columns):
    for column in missing_columns:
        column_name = dataset.columns[column]
        if dataset[column_name].dtype == 'object' or dataset[column_name].dtype == 'int64':
            imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
            dataset[column_name] = imputer.fit_transform(dataset[column_name].values.reshape(-1, 1)).ravel()
        elif dataset[column_name].dtype == 'float64':
            imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
            dataset[column_name] = imputer.fit_transform(dataset[column_name].values.reshape(-1, 1)).ravel()
    return dataset

dataset = impute_missing_values(dataset, missing_cols)

missing_cols = find_missing_columns(dataset)
print(missing_cols)


### Converting the yes and no values to 1 and 0

We first will scan for those columns that have yes and no values and convert them to 1 and 0 respectively.

**Note:** This cell is dependent on the dataset and the column names. If you are using a different dataset, you will need to change the column names.

In [None]:
# Name of the column that needs to be encoded in 0s and 1s
column_name = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']

# function to convert the yes and no values to 1 and 0
def convert_yes_no(dataset, column_name_to_convert):
    # create a label encoder object
    label_encoder = LabelEncoder()
    
    # Apply label encoder on the column and replace the column with the encoded values
    dataset[column_name_to_convert] = dataset[column_name_to_convert].apply(label_encoder.fit_transform)
    
    # convert the dataset to a dataframe
    dataset = pd.DataFrame(dataset)
    
    return dataset

dataset = convert_yes_no(dataset, column_name)

# Print the first 5 rows of the dataset
# print(dataset.head(5))

print(dataset.dtypes)
print(dataset.columns)


### Converting the categorical values to numerical values

Now we will convert the categorical values to numerical values. For this purpose we will use the `OneHotEncoder` class from the `sklearn.preprocessing` module.

In [None]:
# Name of the column that is a categorical variable
categorical_col_names = ['Gender', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaymentMethod']

# function to convert the categorical values to numerical values using one hot encoding
def convert_categorical(dataset, column_names_to_convert):
    column_index = []
    
    # find the index of the categorical columns
    for column_name in column_names_to_convert:
        column_index.append(dataset.columns.get_loc(column_name))
        
    # convert the dataset to a numpy array
    dataset_array = dataset.values
    
    # one hot encoder object 
    one_hot_encoder = OneHotEncoder(dtype=np.int64, handle_unknown='ignore')

    # apply the one hot encoder object on the independent variable dataset
    encoded_x = one_hot_encoder.fit_transform(dataset_array[:, column_index]).toarray()
    
    # drop the original column from the dataset
    dataset_array = np.delete(dataset_array, column_index, axis = 1)
    
    # add the new columns to the dataset
    dataset_array = np.concatenate((dataset_array, encoded_x), axis = 1)
    
    
    # get the column names of the new columns
    encoded_x_column_names = one_hot_encoder.get_feature_names_out(input_features=column_names_to_convert)
    
    # drop the old column from the dataset
    dataset = dataset.drop(column_names_to_convert, axis = 1)
    
    # record the data types of each column
    original_data_types = dataset.dtypes.to_dict()
    # all the data types are int64 for encoded columns
    
    
    # record the last column number of the dataset
    last_column_number = len(dataset.columns)
    
    # reconstruct the new dataset column names
    new_column_names = list(dataset.columns[0:last_column_number-1]) + list(encoded_x_column_names)
    new_column_names.append(dataset.columns[last_column_number-1])
    
    # rearrange the columns of the dataset_array 
    # i.e. bring the dataset_array column with the last column number to the last column number of the new dataset_array
    dataset_array = np.concatenate((dataset_array[:, 0:last_column_number-1], dataset_array[:, last_column_number:], dataset_array[:, last_column_number-1:last_column_number]), axis = 1)
    
    # convert the dataset to a dataframe
    # Here the column names are the original column names and the one hot encoded column names
    # and the values are the values of the dataset array
    dataset = pd.DataFrame(data=dataset_array, columns = new_column_names)
    
    # restore the original data types of the columns
    for column_name in dataset.columns:
        if column_name in original_data_types:
            dataset[column_name] = dataset[column_name].astype(original_data_types[column_name])
        else:
            dataset[column_name] = dataset[column_name].astype('int64')
    
    return dataset

dataset = convert_categorical(dataset, categorical_col_names)

print(dataset.dtypes)
# print(dataset.columns)

### Dividing the dataset into dependent and independent variables

We will divide the dataset into dependent and independent variables. 

The dependent variable is the variable that we want to predict.The independent variables are the variables that we will use to predict the dependent variable. In this case, 

- It is the `Churn` column. 
- All the other columns except the `Churn` column and `Customer ID` column.

In [None]:
# function to divide the dataset into x and y
def divide_dataset(dataset):
    # divide the dataset into x and y
    dataset_columns_length = len(dataset.columns)
    print(dataset_columns_length)

    x = dataset.iloc[:, 0:(dataset_columns_length-1)].values
    y = dataset.iloc[:, (dataset_columns_length-1)].values
    
    return x, y

x, y = divide_dataset(dataset)
# What is the shape of x and y?
print(x.shape)
print(y.shape)


# What is the value of the 1st row of x?
print(x[0])

# What is the value of the 1st row of y?
print(y[0])