![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


# Project Instructions

Does Logistic Regression or Random Forest produce a higher accuracy score in predicting telecom churn in India?

- Load the two CSV files into separate DataFrames. Merge them into a DataFrame named churn_df. Calculate and print churn rate, and identify the categorical variables in churn_df.
- Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them. Define your scaled features and target variable for the churn prediction model.
- Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.
- Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42. Store model predictions in logreg_pred and rf_pred.
- Assess the models on test data. Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to higher_accuracy.

## Resources 
Check resources that can help you solve the problem.

### LESSONS

[Inner join](https://campus.datacamp.com/courses/joining-data-with-pandas/data-merging-basics?ex=1)
[Random Forests (RF)](https://campus.datacamp.com/courses/machine-learning-with-tree-based-models-in-python/bagging-and-random-forests?ex=7)
[Logistic regression](https://campus.datacamp.com/courses/supervised-learning-with-scikit-learn/fine-tuning-your-model-3?ex=4)

### SLIDES

[Classification report in scikit-learn](https://campus.datacamp.com/pdf/web/viewer.html?file=https://projector-video-pdf-converter.datacamp.com/28314/chapter3.pdf#page=18)

### CHEATSHEETS

[Scikit-Learn Cheat Sheet: Python Machine Learning](Scikit-Learn_Cheat_Sheet.pdf)


In [12]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [13]:
# Load the two CSV files into separate DataFrames.
demographics_df = pd.read_csv("telecom_demographics.csv")
demographics_df.head()  

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157


In [14]:
usage_df = pd.read_csv("telecom_usage.csv")
usage_df.head()

Unnamed: 0,customer_id,calls_made,sms_sent,data_used,churn
0,15169,75,21,4532,1
1,149207,35,38,723,1
2,148119,70,47,4688,1
3,187288,95,32,10241,1
4,14016,66,23,5246,1


In [15]:
# Merge them into a DataFrame named churn_df.
churn_df = demographics_df.merge(usage_df, on="customer_id")
churn_df.head()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979,75,21,4532,1
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445,35,38,723,1
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949,70,47,4688,1
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272,95,32,10241,1
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157,66,23,5246,1


In [16]:
# Calculate and print churn rate
churn_rate = churn_df["churn"].sum() / churn_df["churn"].count() 
print(churn_rate)

0.20046153846153847


In [19]:
# Identify the categorical variables in churn_df.
#for col in churn_df.columns:
#    print(col)
categorical_variables = [
    "telecom_partner",
    "gender",
    "state",
    "city",
]

In [22]:
# Convert categorical features in churn_df into features_scaled.
from sklearn.preprocessing import OneHotEncoder

# Create an instance of OneHotEncoder
encoder = OneHotEncoder(handle_unknown="ignore")

# Fit the encoder to the categorical columns
encoded = encoder.fit_transform(churn_df[categorical_variables])

# Convert the encoded array to a dataframe
encoded_df = pd.DataFrame(encoded.toarray(), columns=encoder.get_feature_names_out())

# Join the encoded dataframe with the original dataframe
features_scaled = churn_df.join(encoded_df)

# Drop the original categorical columns
features_scaled.drop(categorical_variables, axis=1, inplace=True)
features_scaled.head()


Unnamed: 0,customer_id,age,pincode,registration_event,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn,...,state_Tripura,state_Uttar Pradesh,state_Uttarakhand,state_West Bengal,city_Bangalore,city_Chennai,city_Delhi,city_Hyderabad,city_Kolkata,city_Mumbai
0,15169,26,667173,2020-03-16,4,85979,75,21,4532,1,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,149207,74,313997,2022-01-16,0,69445,35,38,723,1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,148119,54,549925,2022-01-11,2,75949,70,47,4688,1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,187288,29,230636,2022-07-26,3,34272,95,32,10241,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,14016,45,188036,2020-03-11,4,34157,66,23,5246,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# Check column names
for col in features_scaled.columns:
    print(col)

customer_id
age
pincode
registration_event
num_dependents
estimated_salary
calls_made
sms_sent
data_used
churn
telecom_partner_Airtel
telecom_partner_BSNL
telecom_partner_Reliance Jio
telecom_partner_Vodafone
gender_F
gender_M
state_Andhra Pradesh
state_Arunachal Pradesh
state_Assam
state_Bihar
state_Chhattisgarh
state_Goa
state_Gujarat
state_Haryana
state_Himachal Pradesh
state_Jharkhand
state_Karnataka
state_Kerala
state_Madhya Pradesh
state_Maharashtra
state_Manipur
state_Meghalaya
state_Mizoram
state_Nagaland
state_Odisha
state_Punjab
state_Rajasthan
state_Sikkim
state_Tamil Nadu
state_Telangana
state_Tripura
state_Uttar Pradesh
state_Uttarakhand
state_West Bengal
city_Bangalore
city_Chennai
city_Delhi
city_Hyderabad
city_Kolkata
city_Mumbai


In [None]:
# Perform feature scaling separating the appropriate features and scale them.
