# 2 - PREPARE

In this section I will import the monthly datasets and concatenate them to create a single DataFrame called 'df'.

## 2.1 Contents

2 PREPARE
    2.1 Contents
    2.2 Introduction
    2.3 Imports
    2.4 Objectives
    2.5 Load Data
    


## 2.2 Introduction

In this notebook I will collect the data, organize it and make sure it's well defined. I will then save the data and when it is ready for the next step, data cleaning.

## 2.3 Imports

In [12]:
#import packages
import pandas as pd
import os

## 2.4 Objectives

Fundamental questions to resolve in this notebook before moving on to cleaning:
    
    1. Do I have enough data to tackle objectives?
            Have I identified the required target values?
            Do I have potentially useful features?
    2. Do I have any fundamental issues with the data?

## 2.4 Load the Data

In [40]:
# importing 1 csv at a time
desktop_path1 = "/Users/amylee/Desktop/raw_data/202401-divvy-tripdata.csv"
df1 = pd.read_csv(desktop_path1)

In [41]:
# show first 5 rows
df1.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,C1D650626C8C899A,electric_bike,2024-01-12 15:30:27,2024-01-12 15:37:59,Wells St & Elm St,KA1504000135,Kingsbury St & Kinzie St,KA1503000043,41.903267,-87.634737,41.889177,-87.638506,member
1,EECD38BDB25BFCB0,electric_bike,2024-01-08 15:45:46,2024-01-08 15:52:59,Wells St & Elm St,KA1504000135,Kingsbury St & Kinzie St,KA1503000043,41.902937,-87.63444,41.889177,-87.638506,member
2,F4A9CE78061F17F7,electric_bike,2024-01-27 12:27:19,2024-01-27 12:35:19,Wells St & Elm St,KA1504000135,Kingsbury St & Kinzie St,KA1503000043,41.902951,-87.63447,41.889177,-87.638506,member
3,0A0D9E15EE50B171,classic_bike,2024-01-29 16:26:17,2024-01-29 16:56:06,Wells St & Randolph St,TA1305000030,Larrabee St & Webster Ave,13193,41.884295,-87.633963,41.921822,-87.64414,member
4,33FFC9805E3EFF9A,classic_bike,2024-01-31 05:43:23,2024-01-31 06:09:35,Lincoln Ave & Waveland Ave,13253,Kingsbury St & Kinzie St,KA1503000043,41.948797,-87.675278,41.889177,-87.638506,member


### 2.4.1 Read monthly CSV files into DataFrames and create dictionary of DataFrames

In [52]:
# Define the base path to your data directory
base_path = "/Users/amylee/Desktop/raw_data/" 

# Create an empty dictionary to store the DataFrames
dataframes = {}

# Iterate through months 1 to 11
for month in range(1, 12):
    # Construct the filename for each month
    filename = f"2024{month:02d}-divvy-tripdata.csv"  # Pad month with leading zero (e.g., '01' for Jan)
    filepath = os.path.join(base_path, filename)

    # Read csv file into a dataframe
    try:
        df = pd.read_csv(filepath)
        dataframes[f"df{month}"] = df  # Store the DataFrame in the dictionary
    except FileNotFoundError:
        print(f"Warning: File not found: {filepath}")

In [53]:
# Now I have DataFrames named df1, df2, ..., df11 in the 'dataframes' dictionary
print(dataframes["df1"].head())  # Example: Access and display the first DataFrame

            ride_id  rideable_type           started_at             ended_at  \
0  C1D650626C8C899A  electric_bike  2024-01-12 15:30:27  2024-01-12 15:37:59   
1  EECD38BDB25BFCB0  electric_bike  2024-01-08 15:45:46  2024-01-08 15:52:59   
2  F4A9CE78061F17F7  electric_bike  2024-01-27 12:27:19  2024-01-27 12:35:19   
3  0A0D9E15EE50B171   classic_bike  2024-01-29 16:26:17  2024-01-29 16:56:06   
4  33FFC9805E3EFF9A   classic_bike  2024-01-31 05:43:23  2024-01-31 06:09:35   

           start_station_name start_station_id           end_station_name  \
0           Wells St & Elm St     KA1504000135   Kingsbury St & Kinzie St   
1           Wells St & Elm St     KA1504000135   Kingsbury St & Kinzie St   
2           Wells St & Elm St     KA1504000135   Kingsbury St & Kinzie St   
3      Wells St & Randolph St     TA1305000030  Larrabee St & Webster Ave   
4  Lincoln Ave & Waveland Ave            13253   Kingsbury St & Kinzie St   

  end_station_id  start_lat  start_lng    end_lat    end

In [54]:
# Extract the DataFrames from the dictionary
df1 = dataframes['df1']
df2 = dataframes['df2']
df3 = dataframes['df3']
df4 = dataframes['df4']
df5 = dataframes['df5']
df6 = dataframes['df6']
df7 = dataframes['df7']
df8 = dataframes['df8']
df9 = dataframes['df9']
df10 = dataframes['df10']
df11 = dataframes['df11']

In [55]:
# Create a list of all dataframes
dataframes_list = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11]

### 2.4.2 Concatenate monthly data into a single DataFrame 'df'

In [56]:
# Concatenate all DataFrames into a single DataFrame
df = pd.concat(dataframes_list, ignore_index=True)

In [57]:
# Print the first few rows of the combined DataFrame
df.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,C1D650626C8C899A,electric_bike,2024-01-12 15:30:27,2024-01-12 15:37:59,Wells St & Elm St,KA1504000135,Kingsbury St & Kinzie St,KA1503000043,41.903267,-87.634737,41.889177,-87.638506,member
1,EECD38BDB25BFCB0,electric_bike,2024-01-08 15:45:46,2024-01-08 15:52:59,Wells St & Elm St,KA1504000135,Kingsbury St & Kinzie St,KA1503000043,41.902937,-87.63444,41.889177,-87.638506,member
2,F4A9CE78061F17F7,electric_bike,2024-01-27 12:27:19,2024-01-27 12:35:19,Wells St & Elm St,KA1504000135,Kingsbury St & Kinzie St,KA1503000043,41.902951,-87.63447,41.889177,-87.638506,member
3,0A0D9E15EE50B171,classic_bike,2024-01-29 16:26:17,2024-01-29 16:56:06,Wells St & Randolph St,TA1305000030,Larrabee St & Webster Ave,13193,41.884295,-87.633963,41.921822,-87.64414,member
4,33FFC9805E3EFF9A,classic_bike,2024-01-31 05:43:23,2024-01-31 06:09:35,Lincoln Ave & Waveland Ave,13253,Kingsbury St & Kinzie St,KA1503000043,41.948797,-87.675278,41.889177,-87.638506,member


In [58]:
# High level view of dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682196 entries, 0 to 5682195
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 563.6+ MB


## 2.5 Save Data

In [59]:
df.shape

(5682196, 13)

Save this to your data directory, separately. Uploaded raw data as monthly csvs. Should save derived data in a separate location. This guards against overwriting our original data.

In [60]:
def save_file(df, filename, path):
    """
    Saves a pandas DataFrame to a CSV file.

    Args:
        df: The pandas DataFrame to save.
        filename: The name of the CSV file.
        path: The path to the directory where the file should be saved.
    """
    filepath = f"{path}/{filename}" 
    df.to_csv(filepath, index=False)

In [65]:
%pwd

'/Users/amylee/google_data_analytics/cyclistic_case_study/cyclistic_notebooks'

In [67]:
# save the data to a new csv file
datapath = '/Users/amylee/google_data_analytics/cyclistic_case_study/cyclistic_notebooks'
save_file(df, 'df_cyclistic_prepared.csv', datapath)

### PREPARE SUMMARY

In this section I uploaded the monthly csv data to my working directory. I read the csv files in as individual dataframes then combined them into 1 combined dataframe called 'df'.

The dataframe has 12 columns with 5,682,196 entries.

#### Save the data

In [None]:
df.shape

Save this to your data directory, separately. Uploaded raw data as monthly csvs. Should save derived data in a separate location. This guards against overwriting our original data.

In [None]:
def save_file(df, filename, path):
    """
    Saves a pandas DataFrame to a CSV file.

    Args:
        df: The pandas DataFrame to save.
        filename: The name of the CSV file.
        path: The path to the directory where the file should be saved.
    """
    filepath = f"{path}/{filename}" 
    df.to_csv(filepath, index=False)

In [None]:
# save the data to a new csv file
datapath = '../data'
save_file(df, 'df_cyclistic_clean.csv', datapath)

### PROCESS SUMMARY

Data cleaning tasks completed in this section:
1. Ensured all columns were in appropriate data format - converted 'started_at' and 'ended_at' to datetime
2. Addressed/eliminated null values. 
a - Dropped rows with null values in 'end_lat' and 'end_lng' columns (~0.01% of dataset).
b - Dropped 4 columns: 'start_station_name', 'start_station_id', 'end_station_name', 'end_station_id'. 18% of observations in these columns were missing data. Dataset also contains start/end latitude/longitude so that will be used for geographical analysis. 
5. Checked for duplicates - none. Identified import error and corrected.

I then added columns to calculate ride_length and the day_of_week for each ride.