## Merging two dataframes

Here we do a simple merge between two dataframes concerning tube stations. The two files look like:

`stations.csv`
```
Latitude,Longitude,Station Name
51.5028,-0.2801,"Acton Town"
51.5143,-0.0755,"Aldgate"
51.5154,-0.0726,"Aldgate East"
51.5107,-0.013,"All Saints"
51.5407,-0.2997,"Alperton"
51.5322,-0.1058,"Angel"
```

`Station Stats.csv`

```
Year,Station,Entry-Weekday,Entry-Sat,Entry-Sun,Exit-Weekday,Exit-Sat,Exit-Sun,Entry & Exit-Annual
2007,Acton Town                         ,9205,6722,4427,8899,6320,4304,359.73
2007,Aldgate                            ,9887,2191,1484,10397,2587,1772,302.88
2007,Aldgate East                       ,12820,7040,5505,12271,6220,5000,458.48
2007,Alperton                           ,4611,3354,2433,4719,3450,2503,188.00
```

In this notebook we shall
1. Import the two `.csv` files
2. Clean up the extra spaces in `Station Stats.csv`
3. Merge the dataframes
4. Upload the merged dataframe to Count

### Imports and definitions

In [36]:
import pandas as pd
from count_api import CountAPI

# Set this to the local path of the GitHub repository
data_dir = '/path/to/open-data-scripts/december-2018-hackathon/'
token = ''

### 1. Importing the data

In [4]:
station_locations = pd.read_csv(data_dir + 'Transportation/TubeStations/stations.csv')
station_entry_stats = pd.read_csv(data_dir + 'Transportation/TubeStations/Station Stats.csv')

In [8]:
print(station_locations.head(5))

   Latitude  Longitude  Station Name
0   51.5028    -0.2801    Acton Town
1   51.5143    -0.0755       Aldgate
2   51.5154    -0.0726  Aldgate East
3   51.5107    -0.0130    All Saints
4   51.5407    -0.2997      Alperton


In [9]:
print(station_entry_stats.head(5))

   Year                              Station  Entry-Weekday  Entry-Sat  \
0  2007  Acton Town                                    9205       6722   
1  2007  Aldgate                                       9887       2191   
2  2007  Aldgate East                                 12820       7040   
3  2007  Alperton                                      4611       3354   
4  2007  Amersham                                      4182       1709   

   Entry-Sun  Exit-Weekday  Exit-Sat  Exit-Sun  Entry & Exit-Annual  
0       4427        8899.0      6320      4304               359.73  
1       1484       10397.0      2587      1772               302.88  
2       5505       12271.0      6220      5000               458.48  
3       2433        4719.0      3450      2503               188.00  
4       1004        3938.0      1585       957               134.14  


## 2. Cleaning the data

You'll note above that the `Station` column in `station_entry_stats` has a load of whitespace on the right-hand-side. We'll want to compare station names in the next step, and this will mess things up. Here we'll use the inbuilt python function `strip()` to remove this unnecessary whitespace.

We use the `apply(func)` method on a dataframe column, where `func` is a function to apply to each element in the column. For the case where `func` is very simple (a single statement), we can use an inline [lambda](https://docs.python.org/3.5/tutorial/controlflow.html#lambda-expressions) function to save precious keystrokes:

In [12]:
station_entry_stats['Station'] = station_entry_stats['Station'].apply(lambda x: x.strip())
print(station_entry_stats.head(5))

   Year       Station  Entry-Weekday  Entry-Sat  Entry-Sun  Exit-Weekday  \
0  2007    Acton Town           9205       6722       4427        8899.0   
1  2007       Aldgate           9887       2191       1484       10397.0   
2  2007  Aldgate East          12820       7040       5505       12271.0   
3  2007      Alperton           4611       3354       2433        4719.0   
4  2007      Amersham           4182       1709       1004        3938.0   

   Exit-Sat  Exit-Sun  Entry & Exit-Annual  
0      6320      4304               359.73  
1      2587      1772               302.88  
2      6220      5000               458.48  
3      3450      2503               188.00  
4      1585       957               134.14  


## 3. Merging the dataframes

Now that the station names in each dataframe are the same, we are ready to perform the merge. In the following line, we select the two dataframes (`left` and `right`), the columns to merge on (`left_on` and `right_on`), and then delete a column with duplicate information (`Station Name`).

Finally, we save the dataframe to a CSV file, `merged.csv`. (Setting `index=False` just means we don't save the row numbers)

In [16]:
station_merged = pd.merge(left=station_locations, right=station_entry_stats, left_on='Station Name', right_on='Station').drop('Station Name', axis=1)
print(station_merged.sort_values(by='Station').head(5))

   Latitude  Longitude  Year     Station  Entry-Weekday  Entry-Sat  Entry-Sun  \
0   51.5028    -0.2801  2007  Acton Town           9205       6722       4427   
1   51.5028    -0.2801  2008  Acton Town           9285       6574       4358   
2   51.5028    -0.2801  2009  Acton Town           8601       5816       4231   
3   51.5028    -0.2801  2010  Acton Town           8669       5912       4184   
4   51.5028    -0.2801  2011  Acton Town           8702       6326       4216   

   Exit-Weekday  Exit-Sat  Exit-Sun  Entry & Exit-Annual  
0        8899.0      6320      4304               359.73  
1        9028.0      6295      4361               361.05  
2        8595.0      5803      4324               337.69  
3        8403.0      5877      4224               336.58  
4        8392.0      5976      4223               340.57  


In [17]:
station_merged.to_csv('merged.csv', index=False)

## 4. Uploading to Count

The fun bit! Import the Count API module, and initialise it with your access token, then upload the file saved in step 3:

In [15]:
count = CountAPI()
count.set_api_token(token)
table = count.upload(path='./merged.csv')

Finally, with the table uploaded, create an interactive plot of stations by location, coloured by the number of passengers entering each one during weekdays. Click on the Count logo to explore the data further, or modify the plot.

In [35]:
visual = table.upload_visual(x=table['Longitude'], y=table['Latitude'], color=table['Entry-Weekday'], chart_options={'color_type': 'log'})
visual.embed()