
<img src="img/LinkedIN Header.jpg">

# Data Processing with `pandas`

## Part 1: Merging datasets

In [None]:
import pandas as pd

In [None]:
from dataframes import europe, americas, requirements, prices, currencies, exchange_rates, dublin

Here's some details of outlets in a small coffee shop chain:

In [None]:
europe

In [None]:
americas

**Q1** Create a new DataFrame called `outlets` which contains all eight entries, and has a new row index with unique values (from 0 to 7), with the `europe` entries first. The `location_id` column can be left as it is (we will resolve the duplicate values later). 

*Hint: use [pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) with the `ignore_index=True` parameter*

In [None]:
#add your code below


A new outlet will be opened in Dublin. A site is found and it has the following `requirements`:

In [None]:
requirements

Theres a catalogue of `prices` as follows:

In [None]:
prices

**Q2** Merge the `requirements` table with `prices` on `name`, creating a new DataFrame called `purchases` which is the same as `requirements` but with a `price` column added, and another column `cost` which is equal to `price` * `quantity`:

In [None]:
#add your code below


The details for the Dublin branch are in the `dublin` DataFrame:

In [None]:
dublin

**Q3** Add `dublin` DataFrame to the bottom of the `outlets` DataFrame, again updating the row index so that it has numbers `0` to `8`: 

*Hint: use [pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) with the `ignore_index=True` parameter*

In [None]:
#add your code below


## Part 2: Data preparation and cleaning

Another DataFrame contains the `currency` for each country in which there is an outlet:

In [None]:
currencies

**Q4** Create a DataFrame called `outlets_detail`, which is the same as `outlets` but has an additional column `Currency`, with the given currency for each outlet. 

- Notice that in the `currencies` DataFrame, the column heading `country` is lower case so does not quite match `Country`, and that `currency` needs to be renamed to `Currency`

- Avoid modifying the original `outlets` DataFrame

*Hint: you may find `.drop()` method with `axis=1` parameter to drop columns useful*

In [None]:
#add your code below


Run the following code to create lists of the countries where there are outlets in the two regions:

In [None]:
EUROPE = ['UK', 'Italy', 'France', 'Germany', 'Ireland']
AMERICAS = ['Argentina', 'Brazil', 'USA']

The  function called `region`,  takes a single argument, `country`, and returns 'Europe' if that country is in the  `EUROPE` list, and 'Americas' if that country is in `AMERICAS` list, and 'Other' if the country is in neither list:

In [None]:
def region(country):
    
    if country in EUROPE:
        return 'Europe'
    elif country in AMERICAS:
        return 'Americas'
    else:
        return 'Other'


**Q5** Add a new column `Region` to `outlets_detail` DataFrame, which uses the function `region`and `.apply()` to populate the column values:

In [None]:
#add your code below


**Q6** Create a column `new_id` which contains strings in the format `<Region>_<location_id>`, for example:

`Europe_1`

*Hint: you may find the `.astype()` method useful*

In [None]:
#add your code below


**Q7** Finally, drop the original `location_id` column, and set the index of the DataFrame as the values in the `new_id` column, discarding the original index:

In [None]:
#add your code below


*Note how the `.drop()` method is destructive, in that running the code a second time will throw an error because the given column cannot be found. In these circumstances you may need to re-run previous code to get the DataFrame back to its previous state.*

### Preparation of a different dataset 

In Part 3 we will be working with a different dataset. It will be possible to load the prepared dataset directly later in the notebook, but let's have a go at doing some of this preparation work ourselves first:

In [None]:
df = pd.read_csv('data/ward-profiles.csv')

In [None]:
df.head(3)

The dataset contains data for each ward in London. However, you'll notice that (with the exception of `City of London`), the `Ward name` values are prefixed with the name of the Borough in which it is located.

**Q8** Create a function `borough` which will identify the string ` - ` (a dash with a space on either side) within another string, and return the text which precedes it. If the string ` - ` is not present, the whole string should be returned.

For example:

`City of London` would return `City of London`  
`Barking and Dagenham - Abbey` would return `Barking and Dagenham`  

*Hint: you may find `.split()` method with `sep=' - '` parameter useful*

In [None]:
#add your code below
#def borough(text):


Use `.apply()` and your function `borough` to create a column called `Borough` which contains the returned string:

**Q9** Follow the same process as above, create a function `ward` which will identify the string ` - `  (a dash with a space on either side) within another string, and return the text following it. If the string ` - ` is not present, the whole string should be returned.

- `City of London` would return `City of London`
- `Barking and Dagenham - Abbey` would return `Abbey`



In [None]:
#add your code below
#def ward(text):


Use `.apply()` and your function to create a column called `Ward` which contains the returned string:

Use `.drop()` to get rid of the original `Ward name` column:

**Q10** Finally, move the new `Borough` and `Ward` columns to be the first two columns in the Dataframe:

*Hint: this [Stack Overflow answer](https://stackoverflow.com/questions/35321812/move-column-in-pandas-dataframe/35322540#35322540) may be useful for reference*

In [None]:
#add your code below


If you managed to do all of those tasks, your DataFrame should be the same as `wards` loaded at the beginning of Part 3 below.

## Part 3: Data grouping and aggregation

In [None]:
wards = pd.read_csv('data/ward-profiles-clean.csv')
wards.head(3)

**Q11** Use `.groupby()` to create a Series called `population` which contains the sum of the values in the `Population - 2015` column for each `Borough`:

In [None]:
#add your code below


**Q12** Use `.groupby()` and `.agg()` to create a DataFrame called `cars_stats` which contains the `max`, `min` and `mean` of the `Cars per household - 2011` for each `Borough`: 

In [None]:
#add your code below


Update `cars_stats` so that `mean` is rounded to one decimal place:

In [None]:
#add your code below


**Q13** Create a Series called `ward_count` which has an index of `Borough` and with values showing the `.count()` of `Ward` for each, i.e. the number of wards in each `Borough`. Order this by the values, with the `Borough` with the most wards at the top:

In [None]:
#add your code below


The following DataFrame called `transport` which contains the columns `['Borough', 'Ward', 'Average Public Transport Accessibility score - 2014', '% travel by bicycle to work - 2011']` from `wards`:

In [None]:
transport = wards[['Borough', 
                   'Ward', 
                   'Average Public Transport Accessibility score - 2014', 
                   '% travel by bicycle to work - 2011']]


In [None]:
transport.head()

**Q14** Merge the columns from `cars_stats` into `transport`, such that:

- the number of rows in `transport` remains the same
- three new columns are added (`max`, `min`, `mean`)
- the values in each of these columns for all wards in a given `Borough` are the same

*Hint: you may find `.merge()` method useful*

In [None]:
#add your code below


**Q15** Drop the `max` and `min` columns from `transport` DataFrame:


In [None]:
#add your code below


**Q16** Finally, rename the `mean` column to `Borough household cars - average` in `transport` DataFrame:

In [None]:
#add your code below


In [None]:
#add your code below
