# **DATA CLEANING**
## *Marc Fuentes i Víctor Novelle*

*November 2020*


The first step in order to execute the data cleaning process and obtain clean information from our raw data is to prepare the environment to work comfortably.

In [None]:
# Necessary libraries for code execution.
import pandas as pd
import altair as alt

In [None]:
# Google drive loading as work station for local-usage of the files.
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


Once this step is done, we proceded to load our raw data and make a general analysis to understand the different atributes and values present in it.

We selected october of 2019 as the month to study. This was mainly due two reasons:

* The COVID-19 pandemic had not started, so our data won't be biased by this phenomen.
* We wanted to analyze the daily activity of a station near the **FIB**. In october, all the students already started taking their respective courses, so we shouldn't appreciate big changes due to the end/start of vacations (like in September or December).


In [None]:
# Loading of the selected data set.
data = pd.read_csv('/content/gdrive/My Drive/GCED/Q5/VI/Projecte 1/octubre19.csv')

In [None]:
# Observation of the diferent attributes present on the data set.
print(data.columns)

Index(['station_id', 'num_bikes_available',
       'num_bikes_available_types.mechanical',
       'num_bikes_available_types.ebike', 'num_docks_available',
       'is_installed', 'is_renting', 'is_returning', 'last_reported',
       'is_charging_station', 'status', 'last_updated', 'ttl'],
      dtype='object')


In [None]:
# Observation of the first values of our data set.
data.head()

Unnamed: 0,station_id,num_bikes_available,num_bikes_available_types.mechanical,num_bikes_available_types.ebike,num_docks_available,is_installed,is_renting,is_returning,last_reported,is_charging_station,status,last_updated,ttl
0,1,25,24,1,4,1,1,1,1569881000.0,True,IN_SERVICE,1569880820,30
1,2,12,11,1,15,1,1,1,1569881000.0,True,IN_SERVICE,1569880820,30
2,3,19,19,0,8,1,1,1,1569881000.0,True,IN_SERVICE,1569880820,30
3,4,15,15,0,3,1,1,1,1569881000.0,True,IN_SERVICE,1569880820,30
4,5,35,35,0,3,1,1,1,1569881000.0,True,IN_SERVICE,1569880820,30


Once a proper knowledge of the dataset was adquired, we proceeded to clean it.
The first step consisted on applying a filter to select the stations we were interested in analyzing. We selected the following stations:

*   **302** : Station located near **FIB**. As we previously mentioned, we found that analyzing a station near a univeristy should be interesting, as different patterns depending on the students' schedules may be observed.
*   **98**: Station located near **Sants-Estació**. We selected this station because it's one of the biggest public transport stations in Barcelona and has subway, bus and train connections.
*   **334**: Station located in the **Sarrià** neighborhood. This location was selected in order to see if a more wealthy neighborhood behaves different that the rest of selected stations. Also, we found pretty interesting how its location would affect the daily use, as it's located in a high-altitude zone of Barcelona.

In [None]:
# Selection of the previous named stations.
data_selected = data[data['station_id'].isin([302, 98,334])]

The next step after the stations selection was to convert the differnt temporal variables to a more treatable format (YYY-MM-dd hh:mm:ss).

In [None]:
# Changing the temporal format.
data_selected['last_reported'] = pd.to_datetime(data_selected['last_reported'],unit = 's')
data_selected['last_updated'] = pd.to_datetime(data_selected['last_updated'],unit = 's')

In [None]:
# Checking that the transformation was correctly applied.
data_selected.head()

Unnamed: 0,station_id,num_bikes_available,num_bikes_available_types.mechanical,num_bikes_available_types.ebike,num_docks_available,is_installed,is_renting,is_returning,last_reported,is_charging_station,status,last_updated,ttl
90,98,0,0,0,21,1,1,1,2019-09-30 21:59:06,True,IN_SERVICE,2019-09-30 22:00:20,30
290,302,0,0,0,31,1,1,1,2019-09-30 21:58:24,True,IN_SERVICE,2019-09-30 22:00:20,30
321,334,0,0,0,24,1,1,1,2019-09-30 21:59:31,True,IN_SERVICE,2019-09-30 22:00:20,30
500,98,0,0,0,21,1,1,1,2019-09-30 22:03:28,True,IN_SERVICE,2019-09-30 22:05:11,21
700,302,0,0,0,31,1,1,1,2019-09-30 22:02:46,True,IN_SERVICE,2019-09-30 22:05:11,21


The next step consisted in realizing an integrity analysis. We checked if there was any problem with the stations for the data collected throughout the month, consisting of if it was uninstalled or non renting/returning bikes at any time.

In [None]:
#Checking 
sum(data_selected['is_installed'] == False)

0

In [None]:
#Checking 
sum(data_selected['is_renting'] == False)

0

In [None]:
#Checking 
sum(data_selected['is_returning'] == False)

0

As it can be seen , this wasn't the case, having all of our three stations functioning properly. Also, we checked if one of the selected stations didn't accept electrical bikes and if all the stations were available during the month:

In [None]:
#Checking 
sum(data_selected['is_charging_station'] == False)

0

In [None]:
#Checking 
sum(data_selected['status'] != "IN_SERVICE")

0

The next step consisted on removing the variables that won't be used in the visualitzations:

* *is_installed, is_renting, is_returning*. As we said previously , all the stations behave properly and thus aren't needed.
*  *is_charging_station* and *status*. Idem.
*  *ttl*. *TimeToLive* of the answer. Useless for our represntation purposes.
*  *last_updated*. We discarded this variable because it provided, approximately, the same info as *last_reported*. The former represents the timestamp of the file while the last the timestamp of the station (being more precise and, consequently, we selected it as the variabe to keep).

In [None]:
data_selected = data_selected.drop(columns = ['is_installed','is_renting','is_returning','is_charging_station','status','ttl','last_updated'])

Lastly, we modified the *station_id* column, to change the identifier number of the stations for their respective geographic location. This allows an easier representation of the dataset as well as the visualitzations. Lastly, in order to be coherent with this modification, we also changed the variable name to *station*.

In [None]:
# Change of values and column names (station_id).
data_selected['station_id'].replace({98: "Sants",302:"FIB",334:"Sarrià"},inplace=True)
data_selected = data_selected.rename(columns = {'station_id':'station'})
# Checking that the modifications were correctly applied.
data_selected.head()

Unnamed: 0,station,num_bikes_available,num_bikes_available_types.mechanical,num_bikes_available_types.ebike,num_docks_available,last_reported
90,Sants,0,0,0,21,2019-09-30 21:59:06
290,FIB,0,0,0,31,2019-09-30 21:58:24
321,Sarrià,0,0,0,24,2019-09-30 21:59:31
500,Sants,0,0,0,21,2019-09-30 22:03:28
700,FIB,0,0,0,31,2019-09-30 22:02:46


To end, we generated a CSV from our clean dataset and stored it in the cloud via *Google Drive* to use it on the visualitzations creation process.

In [None]:
data_selected.to_csv("CleanData.csv", index = False)
!cp CleanData.csv '/content/gdrive/My Drive/GCED/Q5/VI/Projecte 1'