<p style="text-align:center"> 
    <a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/" target="_blank"> 
    <img src="../assets/logo.png" width="200" alt="Flavio Aguirre Logo"> 
    </a>
</p>

<h1 align="center"><font size="7"><strong>📉 ByeBye Predictor</strong></font></h1>
<br>
<hr>

## 🧠 Introduction

The goal of this project is to develop a machine learning model capable of predicting whether a customer will churn based on various characteristics related to their behavior and opinions expressed on forums or social media.

We will use Kaggle's renowned "Telco Customer Churn" dataset to train our model, as it has the initial labels necessary for a reasonably good baseline model. We will then simulate a more realistic environment by acquiring data from public comments about phone services on social media platforms like Reddit. This approach allows us to incorporate sentiment analysis and natural language-derived features as part of our predictive model.

<hr>

## 📦 About the dataset
### 1. Structured dataset (customers and churn)
We used the well-known Kaggle public dataset:
* ``Name``: Telco Customer Churn.
  
* ``Source``: [Kaggle - Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).
  
* ``Description``: Contains information on more than 7,000 customers of a fictitious telecommunications company, with variables such as: Gender, contract type, contracted services, Monthly amount, payment history, etc.
  
* ``Target variable``: Churn (Yes / No)

### 2. Unstructured data (customer comments)
We simulated the collection of real opinions using web scraping from public forums, specifically:

``Source``: Reddit (subreddits such as r/movistar)

``Tools``: PRAW, Pushshift API, BeautifulSoup

``Content``: Comments on experiences with telephone companies, frequent complaints, provider changes, and service levels of satisfaction.

<hr>

## 🎯 Project Objective
This project seeks to predict the likelihood of a customer churn by integrating both structured data (age, contract type, service usage) and unstructured data (real user reviews extracted from Reddit).

The approach combines traditional machine learning techniques, natural language processing (NLP), and web scraping to simulate a realistic and challenging environment in the field of Data Science.

<hr>

## 🔄 Project Flow
### 🛠 1. Data Acquisition
* Download the structured dataset from Kaggle.

* Extract comments from Reddit using Python and public APIs.

* Save data in .csv and .json formats.

### 🧹 2. Text Cleaning and Processing
* Noise Removal: URLs, emojis, unnecessary characters

* Comment Tokenization and Lemmatization

* Sentiment Classification Using Tools Such as TextBlob

* Frequent Topic Extraction Using LDA (Latent Dirichlet Allocation)

### 🧠 3. Feature Engineering
Additional variables are generated from the comments:

* sentiment_score (numerical sentiment value)

* negative_comments (count of negative mentions)

* recurring_topics (technical support, billing, portability, etc.)

* Customer Profile Simulation to Enrich the Base Dataset:
    * Age, Plan Type, Contact Frequency, Contract Length

    * Synthetic Association Between Comment and Customer Profile (for Experimental Purposes)

### 🎯 4. Target Labeling (Churn)
The Churn column of the original dataset is used as the target variable (0 = keep, 1 = abandonment)

The relationship between comment sentiment and churn is analyzed.

### 🤖 5. Modeling and Evaluation
* Applied models: Logistic Regression, Random Forest, XGBoost

* Metrics used: Accuracy, Precision, Recall, F1-score

* Model comparison:
    * With structured data only

    * With structured data + sentiment

### 📊 6. Visualization and presentation
Creation of a dashboard with Streamlit to explore:

* Most frequent opinions and topics

* Churn prediction by profile

* Impact of sentiment on the final decision

* Full project documentation on GitHub

<hr>

## ⚙️ Technologies to be used:
* Python

* Pandas, NumPy

* Scikit-learn

* NLTK / spaCy

* TextBlob (for sentiment analysis)

* Matplotlib / Seaborn

* BeautifulSoup