# Logistic Regression Consulting Project


This Project covers the process of building a logistic regression model using customer churn data and predicting the likelihood of new customers churning.

## 1. Data Preparation and Feature Engineering
Data Reading: The customer_churn.csv file, which contains customer churn data, is read into PySpark using inferSchema=True and header=True parameters.

Data Inspection: The describe() method is used to inspect basic statistics for each column in the dataset, confirming that there is no missing data.

VectorAssembler: A VectorAssembler is used to create the necessary feature vector for the machine learning model. Numerical columns such as Age, Total_Purchase, Account_Manager, Years, and Num_Sites are combined into a single features column. Text columns like Name are ignored at this stage because logistic regression cannot process them.

Final DataFrame: A final DataFrame is created containing only the features and churn columns.

##  2. Model Training and Evaluation
Data Splitting: The final DataFrame is split into a 70% training set and a 30% testing set using randomSplit to train and test the model's performance.

Logistic Regression Model: A model object is created using the LogisticRegression class from the pyspark.ml.classification library. The churn column is designated as the target variable (labelCol).

Model Training: The model is trained on the training data using the fit() method. After training, it is evaluated on the test data to assess its performance.

Model Evaluation: A BinaryClassificationEvaluator from the pyspark.ml.evaluation library is used. This tool evaluates the prediction results on the test data and, by default, calculates the Area Under the ROC Curve (AUC). In the example, an AUC value of 0.68 was obtained, which indicates that the model performs better than a random guess but is not perfect.

##  3. Predicting on New Data
Using All Data: The churn model is fit to all available data (final_data) to create a more robust model for future predictions.

Reading New Customer Data: The new_customers.csv file is read with the same parameters to load data for new customers.

Data Transformation: The same assembler object used for the original data is reused to transform the new customer data into the same features vector format, ensuring consistency.

### Making Predictions: 
The trained final model (final_lr_model) is used with the transform() method on the new customer data to make a churn prediction for each new customer. The results show which customers have the potential to churn.

Results: In the example, four of the new customers were predicted to churn. This provides a concrete example of how the model can assist with real-world business decisions.

This lproject demonstrates how to manage the end-to-end machine learning workflow in PySpark, from data preparation and model evaluation to making predictions on new data.

## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
Once you've created the model and evaluated it, test out the model on some new data (you can think of this almost like a hold-out set) that your client has provided, saved under new_customers.csv. The client wants to know which customers are most likely to churn given this data (they don't have the label yet).