# Churn at Fit.ly Tech

### Account Info

It contains the information of every customer account, their state, plan, plan price and churn status.

__Columns:__
- ``signup_date``: The day the customer created their account
- ``customer_id``: Unique ID per customer
- ``email``: Customers email
- ``state``: The state where the customer opened their account. The company only operates in the US for now
- ``plan``: The plan tier that our customers suscribed to (free, basic, pro or enterprise)
- ``plan_list_price``: The price the customer payed for their subscription
- ``churn_status``: The company marks people as churned once their subscription has been cancelled

### Customer Support

df_customer_support
It contains information about client interactions with customer support service

__Columns:__
- ``ticket_time``:Time when the interaction started (Pacific time)
- ``user_id``: A unique ID per user
- ``channel``: The channel the ticket was first received in
- ``topic``: The topic that the ticket addresses
- ``resolution_time_hours``: Hours from ticket creation to ticket resolution
- ``state``: Wether the problem was solved or not ???
- ``comments``: Comments from the client about the interaction

### User Activity 

df_user_activity
This logs every user action with the app.

__Columns:__
- ``event_time``:Time when the client interacted with the app (Pacific time)
- ``user_id``: A unique ID per user
- ``event_type``: The specific action of the client in the app

## Setup

In [45]:
#Import libraries
from utils import config
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline 
import seaborn as sns

# Display options to see all columns
pd.set_option('display.max_columns', None)

# Set a random seed for reproducible results
np.random.seed(42)

## Data Loading

In [46]:
## 2. Data Loading
try:
    df_account_info = pd.read_csv('../data/raw/da_fitly_account_info.csv')
    df_customer_support = pd.read_csv('../data/raw/da_fitly_customer_support.csv')
    df_user_activity = pd.read_csv('../data/raw/da_fitly_user_activity.csv')
    print("All datasets loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading files: {e}")

All datasets loaded successfully.


## Data Validation

__1. Account Info (`df_account_info`)__

The original data is 400 rows and 6 columns. After validation, there were 198 rows remaining. The following describes what I did to each column:


- ``signup_date``: The product manager in the mail said this column existed, but it doesnÂ´t.
- ``customer_id``: Unique ID per customer
- ``email``: Customers email
- ``state``: The state where the customer opened their account. The company only operates in the US for now
- ``plan``: The plan tier that our customers suscribed to (free, basic, pro or enterprise)
- ``plan_list_price``: The price the customer payed for their subscription
- ``churn_status``: The company marks people as churned once their subscription has been cancelled

In [47]:
df_account_info.info()

<class 'pandas.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   customer_id      400 non-null    str  
 1   email            400 non-null    str  
 2   state            400 non-null    str  
 3   plan             400 non-null    str  
 4   plan_list_price  400 non-null    int64
 5   churn_status     114 non-null    str  
dtypes: int64(1), str(5)
memory usage: 18.9 KB


In [48]:
df_account_info.head()

Unnamed: 0,customer_id,email,state,plan,plan_list_price,churn_status
0,C10000,user10000@example.com,New Jersey,Enterprise,105,Y
1,C10001,user10001@example.net,Louisiana,Basic,22,Y
2,C10002,user10002@example.net,Oklahoma,Basic,24,
3,C10003,user10003@example.com,Michigan,Free,0,
4,C10004,user10004@example.com,Texas,Enterprise,119,


Checking for cardinality, customer_id and email are unique so they don't repeat between 

In [49]:
# Cardinality
df_account_info.nunique()

customer_id        400
email              400
state               50
plan                 4
plan_list_price    106
churn_status         1
dtype: int64

Checking for duplicates, there are none.

In [50]:
# Duplicates
df_account_info.duplicated().any()

np.False_

__Cleaning Action 1__: `churn_status` only has a tag when the client churned ("Y"). If the client hasn't churned it shows a missing value. So let's change it to "N".

In [51]:
print(df_account_info['churn_status'].value_counts(dropna=False))
df_account_info['churn_status'] = df_account_info['churn_status'].replace(np.nan,'N')
print(df_account_info['churn_status'].value_counts(dropna=False))

churn_status
NaN    286
Y      114
Name: count, dtype: int64
churn_status
N    286
Y    114
Name: count, dtype: int64


__Cleaning Action 2__: Let's drop the "C" on every value of the `customer_id` column, so it can be used for joining with the other tables. Also I'll change the datatype to int for better performance.

In [52]:
print(df_account_info['customer_id'].value_counts())
df_account_info['customer_id'] = df_account_info['customer_id'].str.replace('C','').astype(int)
print(df_account_info['customer_id'].value_counts())

customer_id
C10000    1
C10001    1
C10002    1
C10003    1
C10004    1
         ..
C10395    1
C10396    1
C10397    1
C10398    1
C10399    1
Name: count, Length: 400, dtype: int64
customer_id
10000    1
10001    1
10002    1
10003    1
10004    1
        ..
10395    1
10396    1
10397    1
10398    1
10399    1
Name: count, Length: 400, dtype: int64


In [None]:
# Check that email domain corresponds to customer_id
df_account_info[df_account_info['email'].str[4:9].astype(int) == df_account_info['customer_id']].shape

(400, 6)

In [76]:
print(df_account_info['state'].value_counts())
print(df_account_info['state'].nunique())
print(df_account_info['state'].unique())

state
Virginia          16
Vermont           14
Florida           14
Delaware          13
Arizona           12
Colorado          12
Montana           12
Idaho             11
Alaska            10
Rhode Island      10
Missouri          10
North Dakota      10
Michigan           9
Texas              9
Ohio               9
West Virginia      9
Pennsylvania       9
Tennessee          9
Massachusetts      9
Mississippi        9
Connecticut        8
Wisconsin          8
Hawaii             8
Utah               8
Washington         8
New York           7
New Mexico         7
Iowa               7
Illinois           7
Minnesota          7
New Jersey         6
Louisiana          6
Oklahoma           6
North Carolina     6
Kansas             6
Nebraska           6
Nevada             6
Arkansas           6
Wyoming            6
Maryland           6
South Carolina     6
Indiana            6
New Hampshire      6
Kentucky           5
South Dakota       5
Oregon             5
Maine              4
Califor

In [53]:
# email domain , may give it a check later
df_account_info['email'].str[9:].value_counts(dropna=False)

email
@example.com    141
@example.org    133
@example.net    126
Name: count, dtype: int64

In [54]:
def inspect_data(df, name):
    print(f"--- Inspection: {name} ---")
    print(f"Shape: {df.shape}")
    print(f"Missing Values: \n{df.isnull().sum()[df.isnull().sum() > 0]}")
    print(f"Duplicates: {df.duplicated().sum()}")
    #print(f"Info: \n{df.dtypes}")
    display(df.head(3))
    #display(df.describe())
    print("\n")
    
    
inspect_data(df_account_info, "Account Info")
inspect_data(df_customer_support, "Customer Support")
inspect_data(df_user_activity, "Products Data")

--- Inspection: Account Info ---
Shape: (400, 6)
Missing Values: 
Series([], dtype: int64)
Duplicates: 0


Unnamed: 0,customer_id,email,state,plan,plan_list_price,churn_status
0,10000,user10000@example.com,New Jersey,Enterprise,105,Y
1,10001,user10001@example.net,Louisiana,Basic,22,Y
2,10002,user10002@example.net,Oklahoma,Basic,24,N




--- Inspection: Customer Support ---
Shape: (918, 7)
Missing Values: 
comments    872
dtype: int64
Duplicates: 0


Unnamed: 0,ticket_time,user_id,channel,topic,resolution_time_hours,state,comments
0,2025-06-13 05:55:17.154573,10125,chat,technical,11.48,1,
1,2025-08-06 13:21:54.539551,10109,chat,account,1.01,0,
2,2025-08-22 12:39:35.718663,10149,chat,technical,10.09,0,Erase my data from your systems.




--- Inspection: Products Data ---
Shape: (445, 3)
Missing Values: 
Series([], dtype: int64)
Duplicates: 0


Unnamed: 0,event_time,user_id,event_type
0,2025-09-08 15:05:39.422721,10118,watch_video
1,2025-09-08 08:15:05.264103,10220,watch_video
2,2025-11-14 06:28:35.207671,10009,share_workout






Let's check column cardinality <br>
- ``customer_id`` and ``email`` are unique, so every row represents a unique client.
- ``state`` has 50 elements, as the number of states in USA, so is correct.
- ``plan`` it has 4 elements, ['Enterprise', 'Basic', 'Free', 'Pro'], it seems correct.
- ``plan_list_price``  has 106 different values, It's rare since we only have 4 plans. My Hypothesis is that the plan prices have changed over time.
- ``churn_status``  only shows the value "Yes" when clients already churned, we need to change the NaN values to "N" so it represents clients that are not churned

In [55]:
# Checamos la cardinalidad
df_account_info.nunique()

customer_id        400
email              400
state               50
plan                 4
plan_list_price    106
churn_status         2
dtype: int64

In [56]:
df_account_info['customer_id'].value_counts()

customer_id
10000    1
10001    1
10002    1
10003    1
10004    1
        ..
10395    1
10396    1
10397    1
10398    1
10399    1
Name: count, Length: 400, dtype: int64

In [57]:
df_account_info['churn_status'] = df_account_info['churn_status'].replace(np.nan,'N')
df_account_info['churn_status'].value_counts()

churn_status
N    286
Y    114
Name: count, dtype: int64

## Cleaning

This is the most important section. Dedicate a sub-section to each dataset. Use Markdown cells to explain why you are making changes (e.g., "Dropping row 402 because the User ID is invalid").

__1. Cleaning Dataset 1 (Users)__
- __Standardize Headers:__ ``df.columns = df.columns.str.lower().str.replace(' ', '_')``
- __Type Casting:__ Convert 'date_joined' to datetime.
- __Handling Nulls:__ Fill or drop.

__2. Cleaning Dataset 2 (Orders)__
- __Consistency:__ Check if ``user_id`` in ``Orders`` exists in ``Users.``
- __Outliers:__ Check for negative prices or impossible dates.

__3. Cleaning Dataset 3 (Products)__
- __String Manipulation:__ Strip whitespace from ``productnames``.
- __Categorization:__ Ensure categories match valid lists.

## Data Validation



__2. Customer Support (`df_customer_support`)__

The original data is 200 rows and 9 columns. After validation, there were 198 rows remaining. The following describes what I did to each column:

- Region: There were 10 unique regions, as expected.
- Place name: There were 185 unique place names, suggesting that some names are duplicated, this should be confirmed with the team providing the data.
- Place type: There are only 4 values for each place type, Coffee Shop, Cafe, Espresson Bar and Others. This matches what is expected.
- Rating: Values range from 3.9 to 5.0, so all are within the range expected.
- Reviews: I removed rows where the Review value was missing. This was 2 rows, leaving 198 rows of data.
- Price: There are 3 price categories, as expected.
- Delivery option: There are 2 delivery options - True/False, as expected.
- Dine-in Option:I converted missing values to False, there were originally no false values.
- Takeaway option: I converted missing values to False, there were also originally no false values.
- ``ticket_time``:Time when the interaction started (Pacific time)
- ``user_id``: A unique ID per user
- ``channel``: The channel the ticket was first received in
- ``topic``: The topic that the ticket addresses
- ``resolution_time_hours``: Hours from ticket creation to ticket resolution
- ``state``: Wether the problem was solved or not ???
- ``comments``: Comments from the client about the interaction

__3. User Activity (`df_user_activity`)__

The original data is 200 rows and 9 columns. After validation, there were 198 rows remaining. The following describes what I did to each column:

- Region: There were 10 unique regions, as expected.
- Place name: There were 185 unique place names, suggesting that some names are duplicated, this should be confirmed with the team providing the data.
- Place type: There are only 4 values for each place type, Coffee Shop, Cafe, Espresson Bar and Others. This matches what is expected.
- Rating: Values range from 3.9 to 5.0, so all are within the range expected.
- Reviews: I removed rows where the Review value was missing. This was 2 rows, leaving 198 rows of data.
- Price: There are 3 price categories, as expected.
- Delivery option: There are 2 delivery options - True/False, as expected.
- Dine-in Option:I converted missing values to False, there were originally no false values.
- Takeaway option: I converted missing values to False, there were also originally no false values.
- ``event_time``:Time when the client interacted with the app (Pacific time)
- ``user_id``: A unique ID per user
- ``event_type``: The specific action of the client in the app

## Integration (Merging)

If your goal is to analyze them together, merge them after they are individually clean.

## Final Validation

Run a final sanity check on the clean/merged data.

In [58]:
#Check for duplicates one last time
assert df_merged.duplicated().sum() == 0, "Duplicates found!"

# Check for unexpected nulls
assert df_merged['order_total'].isnull().sum() == 0, "Nulls in order total!"

print("Data validation passed.")

NameError: name 'df_merged' is not defined

## Export Data

Save the files to `processed` or clean folder.

In [None]:
## Export Data
df_merged.to_csv('../data/processed/master_dataset_clean.csv', index=False)