# Data Processing and Pre-Processing

This notebook loads and pre-processes data from Challenge 1:
- Score files (2020-2023)
- Budget units
- Transaction funnel data
- Analytics data

## 1. Import Libraries

In [13]:
import pandas as pd 
import numpy as np 

## 2. Define File Paths

In [14]:
# Score file paths
scores_20 = '../Data/Challenge_1/scores_20.csv'
scores_21 = '../Data/Challenge_1/scores_21.csv'
scores_22 = '../Data/Challenge_1/scores_22.csv'
scores_23 = '../Data/Challenge_1/scores_23.csv'

## 3. Concatenate Score Files

In [15]:
def concat_scores():
    """Concatenate all score files from different years."""
    dfs = []
    dfs.append(pd.read_csv(scores_20))
    dfs.append(pd.read_csv(scores_21))
    dfs.append(pd.read_csv(scores_22))
    dfs.append(pd.read_csv(scores_23))
    data_scores = pd.concat(dfs, ignore_index=True)
    return data_scores

## 4. Load All Datasets

In [17]:
# Load datasets
data_budget = pd.read_excel('../Data/Challenge_1/budget_units.xlsx')
data_funnel = pd.read_csv('../Data/Challenge_1/transactions.csv')
data_analysis = pd.read_csv('../Data/Challenge_1/analytics_data.csv')
data_scores = concat_scores()

print("All datasets loaded successfully!")

All datasets loaded successfully!


## 5. Preview Datasets

In [18]:
print("Budget Data:")
display(data_budget.head())
print(f"\nShape: {data_budget.shape}")

Budget Data:


Unnamed: 0,Trip,period_20,period_21,period_22,period_23,period_24
0,Art and Architecture in Barcelona,A,A,A,A,A
1,Beach Vacation in the Balearic Islands,B,,,C,
2,"Castle Tour in Bavaria, Germany",B,B,C,B,
3,Cheese and Chocolate Tour in Switzerland,B,B,B,B,
4,Countryside Escape in Tuscany,B,B,B,B,



Shape: (22, 6)


In [19]:
print("Funnel/Transaction Data:")
display(data_funnel.head())
print(f"\nShape: {data_funnel.shape}")

Funnel/Transaction Data:


Unnamed: 0,Date,Customer ID,Status,Operator ID
0,2019-01-02,8022947342,Filled in form,O3
1,2019-06-24,8646438687,Filled in form,O3
2,2019-04-04,9379000553,Filled in form,O3
3,2019-02-04,1464743096,Filled in form,O4
4,2019-04-11,8015744618,Filled in form,O1



Shape: (16120, 4)


In [20]:
print("Analytics Data:")
display(data_analysis.head())
print(f"\nShape: {data_analysis.shape}")

Analytics Data:


Unnamed: 0,trip_name,page_views,unique_visitors,avg_session_duration,bounce_rate,conversion_rate
0,Mountain Hiking in the Swiss Alps,142497,16260,23.3s,45%,1.5%
1,Kayaking in the Norwegian Fjords,96390,12282,25.2s,50%,1.2%
2,Cycling Tour in the Pyrenees,111707,14480,21.9s,48%,1.8%
3,Historical Tour of Rome,152355,20292,30.9s,40%,1.9%
4,Cultural Immersion in Prague,132790,17410,26.8s,42%,1.6%



Shape: (18, 6)


In [21]:
print("Scores Data:")
display(data_scores.head())
print(f"\nShape: {data_scores.shape}")

Scores Data:


Unnamed: 0,Trip,organization,global_satisfaction,period
0,Mountain Hiking in the Swiss Alps,6.2,5.3,3
1,Kayaking in Costa Brava,7.1,6.9,1
2,Cycling Tour in the Pyrenees,7.8,7.4,2
3,Historical Tour of Rome,8.5,7.0,4
4,Cultural Immersion in Prague,7.7,8.6,1



Shape: (74, 4)


## 6. Data Pre-Processing

Check for duplicates and missing values in all datasets.

In [22]:
def pre_processing():
    """Check for duplicates and missing values in all datasets."""
    data = {
        "data_budget": data_budget,
        "data_funnel": data_funnel,
        "data_analysis": data_analysis,
        "data_scores": data_scores,
    }
    
    for name, df in data.items():
        if df.duplicated().any() or df.isnull().values.any():
            print("\n-----------------------------------\n")
            print(f"Dataset: {name}")
            # print("Data information and description:")
            # df.info()
            # df.describe(include='all')

            print("\nCheck duplicates:")
            print(df.duplicated().sum())

            print("\nCheck missing or null:")
            print(df.isnull().sum())
            print("\nPercentage of missing values:")
            print(df.isnull().mean() * 100)

            print("\n-----------------------------------\n")
        else:
            print(f"\n{name} - No duplicates or missing values found.")

### Run Pre-processing Check

In [23]:
pre_processing()


-----------------------------------

Dataset: data_budget

Check duplicates:
0

Check missing or null:
Trip          0
period_20     4
period_21     5
period_22     3
period_23     2
period_24    17
dtype: int64

Percentage of missing values:
Trip          0.000000
period_20    18.181818
period_21    22.727273
period_22    13.636364
period_23     9.090909
period_24    77.272727
dtype: float64

-----------------------------------


data_funnel - No duplicates or missing values found.

data_analysis - No duplicates or missing values found.

data_scores - No duplicates or missing values found.


## 7. Data Cleaning

After the pre-processing, we can come to the conclusion to remove 'period_24' from data_budget as it is Null.

In [24]:
# Remove column with all null values
if 'period_24' in data_budget.columns:
    data_budget = data_budget.drop(columns=['period_24'])
    print("Dropped 'period_24' column from data_budget")
else:
    print("'period_24' column not found in data_budget")

print(f"\nUpdated data_budget shape: {data_budget.shape}")

Dropped 'period_24' column from data_budget

Updated data_budget shape: (22, 5)


## 8. Next Steps

Check that:
- Data types are correct
- Date formats are consistent
- Trip names match across datasets