CRISP-DM (Cross Industry Standard Process for Data Mining) is a widely used process model for data mining projects. In Python, you can implement the CRISP-DM process using various libraries and tools.

Here are the general steps of the CRISP-DM process and some examples of Python libraries and tools that can be used for each step:

**Business Understanding**: 

In this step, you define the problem you want to solve and the goals of your data project. You also identify the data sources you need to use.

Example tools: Jupyter Notebook, pandas, NumPy, Matplotlib, Seaborn, Scikit-learn

**Data Understanding:**

In this step, you explore the data you have collected to gain insights into its quality, completeness, and relevance to the problem you want to solve.

Example tools: pandas, NumPy, Matplotlib, Seaborn, Scikit-learn

**Data Preparation:**

In this step, you clean and preprocess the data to prepare it for modeling. This may involve removing missing values, scaling and normalizing the data, and feature selection.

Example tools: pandas, NumPy, Scikit-learn

**Modeling:**

In this step, you create predictive models using machine learning algorithms. You may also use techniques such as clustering and association analysis to gain insights from the data.

Example tools: Scikit-learn, TensorFlow, Keras, PyTorch

Evaluation: In this step, you evaluate the models you have created to determine their effectiveness in solving the problem you identified in step 1.

Example tools: Scikit-learn, TensorFlow, Keras, PyTorch

**Deployment:**

In this step, you deploy the models you have created in a production environment.

Example tools: Flask, Django

In [1]:
# Import librabries
import numpy as np
import pandas as pd

In [2]:
#Load the Dataset
data = pd.read_csv("../dataset/ride_sharing_new.csv")

In [3]:
#Check the first five rows
data.head()

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


In [4]:
#Check the last five rows
data.tail()

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
25755,11 minutes,15,San Francisco Ferry Building (Harry Bridges Pl...,34,Father Alfred E Boeddeker Park,5063,1,2000,Male
25756,10 minutes,15,San Francisco Ferry Building (Harry Bridges Pl...,34,Father Alfred E Boeddeker Park,5411,2,1998,Male
25757,14 minutes,15,San Francisco Ferry Building (Harry Bridges Pl...,42,San Francisco City Hall (Polk St at Grove St),5157,2,1995,Male
25758,14 minutes,15,San Francisco Ferry Building (Harry Bridges Pl...,42,San Francisco City Hall (Polk St at Grove St),4438,2,1995,Male
25759,29 minutes,16,Steuart St at Market St,115,Jackson Playground,1705,3,1990,Male


In [5]:
#Check the shape of the dataset
data.shape


(25760, 9)

The dataset has 25760 rows and 9 columns

In [6]:
#Columns of the dataset
data.columns

Index(['duration', 'station_A_id', 'station_A_name', 'station_B_id',
       'station_B_name', 'bike_id', 'user_type', 'user_birth_year',
       'user_gender'],
      dtype='object')

In [7]:
#Brief info on the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   duration         25760 non-null  object
 1   station_A_id     25760 non-null  int64 
 2   station_A_name   25760 non-null  object
 3   station_B_id     25760 non-null  int64 
 4   station_B_name   25760 non-null  object
 5   bike_id          25760 non-null  int64 
 6   user_type        25760 non-null  int64 
 7   user_birth_year  25760 non-null  int64 
 8   user_gender      25760 non-null  object
dtypes: int64(5), object(4)
memory usage: 1.8+ MB


In [4]:
#Summary Statistics on the dataset
data.describe()

Unnamed: 0,station_A_id,station_B_id,bike_id,user_type,user_birth_year
count,25760.0,25760.0,25760.0,25760.0,25760.0
mean,31.023602,89.558579,4107.621467,2.008385,1983.054969
std,26.409263,105.144103,1576.315767,0.704541,10.010992
min,3.0,3.0,11.0,1.0,1901.0
25%,15.0,21.0,3106.0,2.0,1978.0
50%,21.0,58.0,4821.0,2.0,1985.0
75%,67.0,93.0,5257.0,3.0,1990.0
max,81.0,383.0,6638.0,3.0,2001.0


In [9]:
data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
25755    False
25756    False
25757    False
25758    False
25759    False
Length: 25760, dtype: bool

In [8]:
#Check for duplicate entries
data[data.duplicated()]

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
604,9 minutes,81,Berry St at 4th St,81,Berry St at 4th St,1225,2,1993,Male
15217,17 minutes,22,Howard St at Beale St,102,Irwin St at 8th St,492,3,1961,Female
18303,10 minutes,30,San Francisco Caltrain (Townsend St at 4th St),6,The Embarcadero at Sansome St,4442,1,1967,Male
20170,4 minutes,21,Montgomery St BART Station (Market St at 2nd St),343,Bryant St at 2nd St,5034,2,1993,Male


In [9]:
data[data["bike_id"]==1225]

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
505,7 minutes,81,Berry St at 4th St,15,San Francisco Ferry Building (Harry Bridges Pl...,1225,3,1979,Male
566,9 minutes,81,Berry St at 4th St,81,Berry St at 4th St,1225,2,1993,Male
604,9 minutes,81,Berry St at 4th St,81,Berry St at 4th St,1225,2,1993,Male
4105,11 minutes,3,Powell St BART Station (Market St at 4th St),350,8th St at Brannan St,1225,3,1995,Male
8691,29 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,125,20th St at Bryant St,1225,2,1981,Male
14615,9 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,88,11th St at Bryant St,1225,2,1989,Female
19063,3 minutes,22,Howard St at Beale St,17,Embarcadero BART Station (Beale St at Market St),1225,3,1982,Male
20960,2 minutes,5,Powell St BART Station (Market St at 5th St),321,5th St at Folsom,1225,2,1988,Male


In [10]:
#Check for number of  duplicate entries
data.duplicated().sum()

4

In [11]:
#Drop duplicates
data.drop_duplicates(inplace=True)

In [12]:
#check the shape of the dataset
data.shape

(25756, 9)

In [13]:
#Data type of each columns
data.dtypes

duration           object
station_A_id        int64
station_A_name     object
station_B_id        int64
station_B_name     object
bike_id             int64
user_type           int64
user_birth_year     int64
user_gender        object
dtype: object

### Highlights for Data Cleaning
- **Duration should be converted to an integer data type.**

    - Stripping the minutes from the duration.
    - Converting the string duration to an integer
    - Renaming the column from duration to duration_mins
    
- **User Type should be converted to categorical**


### Converting Duration to an integer

In [16]:
data["duration"]

0        12 minutes
1        24 minutes
2         8 minutes
3         4 minutes
4        11 minutes
            ...    
25755    11 minutes
25756    10 minutes
25757    14 minutes
25758    14 minutes
25759    29 minutes
Name: duration, Length: 25756, dtype: object

In [14]:
# Stripping the minutes
data["duration"] =data["duration"].str.strip(" minutes")

In [15]:
data["duration"]

0        12
1        24
2         8
3         4
4        11
         ..
25755    11
25756    10
25757    14
25758    14
25759    29
Name: duration, Length: 25756, dtype: object

In [16]:
data["duration"].dtype

dtype('O')

In [17]:
# Convert the duration from Object to Integer Data Type
data["duration"]= data["duration"].astype("int")

In [18]:
data["duration"].dtype

dtype('int32')

In [21]:
data.dtypes

duration            int32
station_A_id        int64
station_A_name     object
station_B_id        int64
station_B_name     object
bike_id             int64
user_type           int64
user_birth_year     int64
user_gender        object
dtype: object

In [22]:
data.head()

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,4,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,11,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


In [23]:
#Renaming duration column to duration_mins
data.rename({"duration":"duration_mins"},axis=1, inplace=True)

In [24]:
data.head()

Unnamed: 0,duration_mins,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,4,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,11,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


In [25]:
data.dtypes

duration_mins       int32
station_A_id        int64
station_A_name     object
station_B_id        int64
station_B_name     object
bike_id             int64
user_type           int64
user_birth_year     int64
user_gender        object
dtype: object

In [30]:
# Convert the user_type column to a category
data["user_type"]= data["user_type"].astype("category")
data["user_gender"]= data["user_gender"].astype("category")

In [31]:
data.dtypes

duration_mins         int32
station_A_id          int64
station_A_name       object
station_B_id          int64
station_B_name       object
bike_id               int64
user_type          category
user_birth_year       int64
user_gender        category
dtype: object

In [33]:
#Unique values of user_birth_year
data.user_birth_year.unique()

array([1959, 1965, 1993, 1979, 1994, 1981, 1991, 1982, 1983, 1990, 1998,
       1987, 1997, 1995, 1964, 2000, 1986, 1992, 1980, 1977, 1988, 1989,
       1966, 1967, 1996, 1973, 1999, 1985, 1975, 1961, 1970, 1958, 1969,
       1962, 1978, 1984, 1948, 1976, 1960, 1972, 1974, 1968, 1957, 1963,
       1971, 1951, 1954, 1952, 1953, 1943, 1945, 1927, 1955, 1902, 1956,
       1950, 1942, 1901, 1947, 1936, 2001, 1939, 1949], dtype=int64)

In [34]:
### Count of Gender
data["user_gender"].value_counts()

Male      19379
Female     6026
Other       351
Name: user_gender, dtype: int64

In [35]:
# Saving to a csv
data.to_csv("../dataset/ride_sharing_cleaned.csv", index=False)