# Working on the Datasets

We will use the pandas library to read the datasets. We will use the `read_csv` method to read the csv files in python and assign it to a dataframe object.

### Required Libraries

- pandas
- matplotlib
- numpy
- scikit-learn


To install the libraries, you can use `pip` or `conda` as follows:

```bash
pip install pandas matplotlib scikit-learn numpy
```

```bash
conda install pandas matplotlib scikit-learn numpy
```


In [26]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

### About the dataset

This dataset can be found on the following [link](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).

The dataset contains 7043 rows and 21 columns.

The dataset contains information about a telecom company that has a problem with the churn rate of its customers. The company wants to know which customers are likely to churn next month. The dataset contains information about:

- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about the customers

In [None]:
dataset = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# replace the cells with single and extra spaces with a single space
dataset = dataset.replace(r'\s+', ' ', regex=True)

# replace the single space with a NaN value
dataset = dataset.replace(r'^\s*$', np.nan, regex=True)

# drop the id column from the dataset
dataset = dataset.drop('customerID', axis=1)

# How many rows and columns are in this dataset?
print(dataset.shape)

# Store the column of data types in a list called data_types
data_types = dataset.dtypes

# Print the data type of each column
print(data_types)

# What is the first entry in the dataset?
print(dataset.head(2))

# What is the last entry in the dataset?
print(dataset.tail(2))

# What is the name of each column?
print(dataset.columns)

### Converting the yes and no values to 1 and 0

We first will scan for those columns that have yes and no values and convert them to 1 and 0 respectively.

**Note:** This cell is dependent on the dataset and the column names. If you are using a different dataset, you will need to change the column names.

In [None]:
dataset = pd.DataFrame(dataset)

# Name of the column that needs to be encoded
column_name = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']

# Create a label encoder object
le = LabelEncoder()

# Apply the label encoder object on the data
dataset[column_name] = dataset[column_name].apply(le.fit_transform)

# Print the first 5 rows of the dataset
print(dataset.head(5))


### Dividing the dataset into dependent and independent variables

We will divide the dataset into dependent and independent variables. 

The dependent variable is the variable that we want to predict.The independent variables are the variables that we will use to predict the dependent variable. In this case, 

- It is the `Churn` column. 
- All the other columns except the `Churn` column and `Customer ID` column.

In [None]:
# divide the dataset into x and y
dataset_columns_length = len(dataset.columns)
print(dataset_columns_length)

x = dataset.iloc[:, 0:(dataset_columns_length-1)].values
y = dataset.iloc[:, (dataset_columns_length-1)].values

# What is the shape of x and y?
print(x.shape)
print(y.shape)


# What is the value of the 1st row of x?
print(x[0])

# What is the value of the 1st row of y?
print(y[0])

### Checking for missing values

We will check for missing values in the dataset. If there are any missing values, we will replace them with the mean of the column.

In [None]:
dataset.isnull().sum()   # check for missing values

# Store the column number of the columns with missing values in a list called missing_cols
missing_cols = [i for i in range(dataset_columns_length) if dataset.iloc[:, i].isnull().any()]
print(missing_cols)

# Print columns index and names with missing values and the number of missing values
for i in missing_cols:
    print(i, dataset.columns[i], dataset.iloc[:, i].isnull().sum())


In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# impute missing values in the columns with missing values
# Here the hard coded values are the column numbers of the columns with missing values 
x[:, 18] = imputer.fit_transform(x[:, 18].reshape(-1, 1)).ravel()

print(x[488])


### Re check for the datatypes of the each column in x, y

