# Churn at Fit.ly Tech

## Project Context

__Problem Statement:__<br>
Eres un data analyst que se acaba de unir a Fit.ly Tech, una app de fitness con subscripción y recibiste un mail de tu manager con un nuevo task.
<br>

__Contexto sobre el problema:__ <br>
- En los últimos 2 trimestres han notado churn creciendo en su base de clientes.
- Es critico retener clientes porque su costo de adquisición de clientes está creciendo y cada empleado que nos deja pone más presión en marketing y producto
- __Necesitamos una foto clara de que está provocando el churn y algunas acciones practicas que podemos realizar de cara al siguiente trimestre__
- Traemos un dataset de diferentes fuentes (actividad de usuarios, soporte al cliente e información de cuentas).La data está desordenada, pues viene de diferentes equipos que usan diferentes convenciones.
- Usa esta data como punto de partida para un análisis de churn, __asegurate de incluir engagement, support activity y plan type__, pues son cosas de las que el negocio se preocupa. Enfócate en identificar patrones y potenciales drivers del churn y algunos otros KPIs que creas que nuestros lideres le prestaran atención.
- __Haz el análisis y escribe un reporte corto para el manager.__ El no necesita ver el código, pero quiere ver el pensamiento, como manejaste la limpieza de datos y la interpretación y como llegaste a tus conclusiones
- Prepara y entrega la presentación a senior leadership. Recuerda que ellos no son data specialists y se van a enfocar en que pueden hacer ellos para identificar/atacar el problema del churn.
- El jefe está de vacaciones, si necesitas tomar decisiones, incluyelas en el trabajo y los revisará cuando esté de regreso.

__Contexto sobre la data:__ <br>
Lo que dijo el Lead Engineer: <br>
- Cada dataset viene de una fuente diferente. historicamente estosfueron construidos independientemente por cada equipo, entonces el lead engineer no sabe si tienen prácticas de estandarización o validación, __no estaría sorprendido de escuchar información sobre data faltante o no estandarizada, pues los sistemas se caen de vez en cuando.__
- Ellos no hacen limpieza,trasformación ni anonimización como parte de los pipelines. Lo único que hacen es calcular el tiempo de resolución de los tickets en ETL.
- Cada dataset se actualiza diario
- Data faltante o irregular puede ocurrir si algun sistema de carga de datos experimenta downtime. Ellos automáticamente no rellenan los gaps al menos que haya un incidente mayor
- La ubicación de todos los campos está basada en información de billing del usuario.

Guide to Analysis Projects
1. I would like you to create a written report to summarize the analysis you have
performed and your findings. The report will be read by me (Head of Analysis). The list
below describes what I expect to see in your written report.
2. You will need to use a DataLab workbook to write up your findings and share
visualizations.
3. You must use the data provided for the analysis.
4. You will also need to prepare and deliver a presentation. You should prepare around
8-10 slides to present to senior leadership. The list below describes what they expect to
see in your presentation.
5. Your presentation should be no longer than 10 minutes.

Written Report
Your written report should include written text summaries and graphics of the following:
2
- Data validation:
    - Describe validation and cleaning steps for every column in the data
- Exploratory Analysis to answer the customer questions ensuring you include:
   - Two different types of graphic showing single variables only
   - At least one graphic showing two or more variables
   - Description of your findings
- Definition of a metric for the business to monitor
    - How should the business monitor what they want to achieve?
    - Estimate the initial value(s) for the metric based on the current data?
    - Final summary including recommendations that the business should undertake

Presentation <br>
You will give an overview presentation to senior leadership. The presentation should include:
- An overview of the project and business goals
- A summary of the work you undertook and how this addresses the problem
- Your key findings including the metric to monitor and current estimation
- Your recommendations to the business


## Setup

Keep your imports and global settings at the very top.

In [1]:
#Import libraries
from utils import config
import pandas as pd
import numpy as np

# Display options to see all columns
pd.set_option('display.max_columns', None)

# Set a random seed for reproducible results
np.random.seed(42)

## Data Loading

Load all three datasets into distinct dataframes with clear names. Avoid names like df1, df2; use semantic names like df_users, df_orders, df_products.

In [None]:
## 2. Data Loading
try:
    df_users = pd.read_csv('data/raw/users_v1.csv')
    df_orders = pd.read_csv('data/raw/orders_v1.csv')
    df_products = pd.read_csv('data/raw/products_v1.csv')
    print("All datasets loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading files: {e}")

### Account Info

It contains the information of every customer account, their state, plan, plan price and churn status.

__Columns:__
- ``signup_date``: The day the customer created their account
- ``customer_id``: Unique ID per customer
- ``email``: Customers email
- ``state``: The state where the customer opened their account. The company only operates in the US for now
- ``plan``: The plan tier that our customers suscribed to (free, basic, pro or enterprise)
- ``plan_list_price``: The price the customer payed for their subscription
- ``churn_status``: The company marks people as churned once their subscription has been cancelled

In [39]:
df_account_info = pd.read_csv(config.ACCOUNT_INFO_PATH)
df_account_info.head()

Unnamed: 0,customer_id,email,state,plan,plan_list_price,churn_status
0,C10000,user10000@example.com,New Jersey,Enterprise,105,Y
1,C10001,user10001@example.net,Louisiana,Basic,22,Y
2,C10002,user10002@example.net,Oklahoma,Basic,24,
3,C10003,user10003@example.com,Michigan,Free,0,
4,C10004,user10004@example.com,Texas,Enterprise,119,


### Customer Support

It contains information about client interactions with customer support service

__Columns:__
- ``ticket_time``:Time when the interaction started (Pacific time)
- ``user_id``: A unique ID per user
- ``channel``: The channel the ticket was first received in
- ``topic``: The topic that the ticket addresses
- ``resolution_time_hours``: Hours from ticket creation to ticket resolution
- ``state``: Wether the problem was solved or not ???
- ``comments``: Comments from the client about the interaction

In [None]:
df_customer_support = pd.read_csv(config.CUSTOMER_SUPPORT_PATH )
df_customer_support

Unnamed: 0,ticket_time,user_id,channel,topic,resolution_time_hours,state,comments
0,2025-06-13 05:55:17.154573,10125,chat,technical,11.48,1,
1,2025-08-06 13:21:54.539551,10109,chat,account,1.01,0,
2,2025-08-22 12:39:35.718663,10149,chat,technical,10.09,0,Erase my data from your systems.
3,2025-06-07 02:49:46.986055,10268,phone,account,9.10,1,
4,2025-07-25 00:24:38.945079,10041,phone,other,2.28,1,
...,...,...,...,...,...,...,...
913,2025-06-05 23:09:46.282238,10225,chat,other,32.46,0,
914,2025-08-19 11:03:30.765219,10081,chat,account,4.98,0,
915,2025-08-07 13:17:13.090150,10373,chat,other,14.65,0,
916,2025-07-06 11:28:27.421494,10148,phone,billing,18.35,1,


### User Activity 

This logs every user action with the app.

__Columns:__
- ``event_time``:Time when the client interacted with the app (Pacific time)
- ``user_id``: A unique ID per user
- ``event_type``: The specific action of the client in the app

In [40]:
df_user_activity = pd.read_csv(config.USER_ACTIVITY_PATH)
df_user_activity

Unnamed: 0,event_time,user_id,event_type
0,2025-09-08 15:05:39.422721,10118,watch_video
1,2025-09-08 08:15:05.264103,10220,watch_video
2,2025-11-14 06:28:35.207671,10009,share_workout
3,2025-08-20 16:53:38.682901,10227,read_article
4,2025-07-24 16:47:31.728422,10123,track_workout
...,...,...,...
440,2025-10-22 12:09:09.477660,10395,read_article
441,2025-09-22 18:28:07.618295,10264,watch_video
442,2025-07-28 14:52:16.311955,10241,track_workout
443,2025-11-27 13:38:32.371575,10340,track_workout


## Initial Inspection
Create a summary of all three datasets before touching them. This establishes a baseline.

Pro Tip: Write a small helper function to avoid repeating code 3 times.
The dirty check

In [60]:
def inspect_data(df, name):
    print(f"--- Inspection: {name} ---")
    print(f"Shape: {df.shape}")
    print(f"Missing Values: \n{df.isnull().sum()[df.isnull().sum() > 0]}")
    print(f"Duplicates: {df.duplicated().sum()}")
    #print(f"Info: \n{df.dtypes}")
    display(df.head(3))
    #display(df.describe())
    print("\n")

inspect_data(df_account_info, "Account Info")
inspect_data(df_customer_support, "Customer Support")
inspect_data(df_user_activity, "Products Data")

--- Inspection: Account Info ---
Shape: (400, 6)
Missing Values: 
churn_status    286
dtype: int64
Duplicates: 0


Unnamed: 0,customer_id,email,state,plan,plan_list_price,churn_status
0,C10000,user10000@example.com,New Jersey,Enterprise,105,Y
1,C10001,user10001@example.net,Louisiana,Basic,22,Y
2,C10002,user10002@example.net,Oklahoma,Basic,24,




--- Inspection: Customer Support ---
Shape: (918, 7)
Missing Values: 
comments    872
dtype: int64
Duplicates: 0


Unnamed: 0,ticket_time,user_id,channel,topic,resolution_time_hours,state,comments
0,2025-06-13 05:55:17.154573,10125,chat,technical,11.48,1,
1,2025-08-06 13:21:54.539551,10109,chat,account,1.01,0,
2,2025-08-22 12:39:35.718663,10149,chat,technical,10.09,0,Erase my data from your systems.




--- Inspection: Products Data ---
Shape: (445, 3)
Missing Values: 
Series([], dtype: int64)
Duplicates: 0


Unnamed: 0,event_time,user_id,event_type
0,2025-09-08 15:05:39.422721,10118,watch_video
1,2025-09-08 08:15:05.264103,10220,watch_video
2,2025-11-14 06:28:35.207671,10009,share_workout






Let's check column cardinality <br>
- ``customer_id`` and ``email`` are unique, so every row represents a unique client.
- ``state`` has 50 elements, as the number of states in USA, so is correct.
- ``plan`` it has 4 elements, ['Enterprise', 'Basic', 'Free', 'Pro'], it seems correct.
- ``plan_list_price``  has 106 different values, It's rare since we only have 4 plans. My Hypothesis is that the plan prices have changed over time.
- ``churn_status``  only shows the value "Yes" when clients already churned, we need to change the NaN values to "N" so it represents clients that are not churned

In [None]:
# Checamos la cardinalidad
df_account_info.nunique()

customer_id        400
email              400
state               50
plan                 4
plan_list_price    106
churn_status         2
dtype: int64

## Cleaning

This is the most important section. Dedicate a sub-section to each dataset. Use Markdown cells to explain why you are making changes (e.g., "Dropping row 402 because the User ID is invalid").

__1. Cleaning Dataset 1 (Users)__
- __Standardize Headers:__ ``df.columns = df.columns.str.lower().str.replace(' ', '_')``
- __Type Casting:__ Convert 'date_joined' to datetime.
- __Handling Nulls:__ Fill or drop.

__2. Cleaning Dataset 2 (Orders)__
- __Consistency:__ Check if ``user_id`` in ``Orders`` exists in ``Users.``
- __Outliers:__ Check for negative prices or impossible dates.

__3. Cleaning Dataset 3 (Products)__
- __String Manipulation:__ Strip whitespace from ``productnames``.
- __Categorization:__ Ensure categories match valid lists.

## Integration (Merging)

If your goal is to analyze them together, merge them after they are individually clean.

In [None]:
# Example: Merge Orders with Users
df_merged = df_orders.merge(df_users, on='user_id', how='left')

# Check for data loss after merge (verify row counts)
print(f"Orders shape: {df_orders.shape}")
print(f"Merged shape: {df_merged.shape}")

## Final Validation

Run a final sanity check on the clean/merged data.

In [None]:
#Check for duplicates one last time
assert df_merged.duplicated().sum() == 0, "Duplicates found!"

# Check for unexpected nulls
assert df_merged['order_total'].isnull().sum() == 0, "Nulls in order total!"

print("Data validation passed.")

## Export Data

Save the files to `processed` or clean folder.

In [None]:
## Export Data
df_merged.to_csv('data/processed/master_dataset_clean.csv', index=False)