**Following guidelines set expectations for participant behaviour during workshop activities. They also ensure that the class environment is welcoming, inclusive, and respectful.**
- Where a discussion is taking place, allow everyone a chance to speak
- Listen respectfully, without interrupting and with an open mind to understanding others’ views
- Be professional and productive, and always share your ideas, your opinion matters, we all can learn something from each other
- Personal information that comes up in the conversation should be kept confidential
- Avoid inflammatory language
- Avoid assumptions about any member of the class or generalisations about social groups

**You have now joined a group of fellow analysts in a workshop:**
#### Outcomes:

- To complete a group activity to identify how Data Processing with pandas could be useful for your job role
- To be able to answer the questions to test your knowledge of Data Processing with pandas


#### Note:
- We understand learners will progress at their own speed
- Tutors will be on hand to answer questions

## Group activity

- Identify how **Data Processing with pandas** could be useful for data analytics within your job role? And share your thoughts.

In [None]:
#add your notes below



# Data Processing with `pandas`

## Part 1: Combining datasets

- Make sure you run the following code cell before you attempt any of the questions
- In the following section, you will be analysing some data related to a small coffee shop chain

In [None]:
import pandas as pd

from dataframes import europe, americas, requirements, prices, currencies, exchange_rates, dublin

Here are some details of outlets in a small coffee shop chain:

In [None]:
europe

In [None]:
americas

**Q1)** Concatenate two pandas DataFrames

- For this task use `pd.concat()` function with `ignore_index=True` parameter to combine `europe` and `americas` dataframes, with the `europe` entries first 

- The `location_id` column can be left as it is (we will resolve the duplicate values later)

- Call the new DataFrame `outlets` which should contains eight entries, and has a index with unique values from 0 to 7 

*To find out more about: [pd.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [None]:
#add your code below

outlets = pd.concat([europe, americas], ignore_index=True)
outlets


A new outlet will be opened in Dublin. A site is found, and it has the following `requirements`:

In [None]:
requirements

There’s a catalogue of `prices` as follows:

In [None]:
prices

**Q2)** Join two Pandas DataFrames

- For this task use `.merge()` method

- Merge the `requirements` table with `prices` on `name`, creating a new DataFrame called `purchases` which is the same as `requirements` but with a `price` column added, and another column `cost` which is equal to `price` * `quantity`:

*To find out more about: [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)

In [None]:
#add your code below

purchases = requirements.merge(prices, how='left', on='name')
purchases['cost'] = purchases['price'] * purchases['quantity']


The details for the Dublin branch are in the `dublin` DataFrame:

In [None]:
dublin

**Q3)** Concatenate two pandas DataFrames
- Add `dublin` DataFrame to the bottom of the `outlets` DataFrame using `pd.concat()` function with `ignore_index=True` parameter, again updating the row index so that it has numbers `0` to `8`:
- Save the new DataFrame as `outlets_new`

In [None]:
#add your code below

outlets_new = pd.concat([outlets, dublin], ignore_index=True)
outlets_new


## Part 2: Data preparation and cleaning

- Make sure you run the following code cell before you attempt any of the questions
- This section is a continuation of the previous analysis activities, here you will be analysing some data related to a small coffee shop chain

`currencies` DataFrame contains the `currency` information for each country in which there is an outlet:

In [None]:
currencies

**Q4)**  Avoid modifying the original `outlets_new` DataFrame. 

- Use `.copy()` method to create a new DataFrame called `outlets_detail`, which is the same as `outlets_new` DataFrame
- Merge `outlets_detail` DataFrame with `currencies` DataFrame to get currency information for each outlet. Notice that in the `currencies` DataFrame, the column heading `country` is lower case so does not quite match with column heading `Country` in `outlets_detail` DataFrame
- Use `.drop()` method with `axis=1` parameter to drop column heading `country`
- Use `.rename` method to rename the column heading `currency` to `Currency`

In [None]:
#add your code below
#outlets_detail = outlets_new.copy()

outlets_detail = outlets_new.copy()
outlets_detail = outlets_detail.merge(currencies, how='left', left_on='Country', right_on='country')
outlets_detail.drop('country', axis=1, inplace=True)
outlets_detail.rename(columns={'currency': 'Currency'}, inplace=True)
outlets_detail


Run the following code to create lists of the countries where there are outlets in the two regions:

In [None]:
EUROPE = ['UK', 'Italy', 'France', 'Germany', 'Ireland']
AMERICAS = ['Argentina', 'Brazil', 'USA']

You have been given the below code for the function called `region`, which takes a single argument, `country`, and returns **'Europe'** if in `EUROPE` list, **'Americas'** if in `AMERICAS` list, and **'Other'** if in neither list. Make sure you run the following code cell before you attempt any of the questions:

In [None]:
def region(country):
    
    if country in EUROPE:
        return 'Europe'
    elif country in AMERICAS:
        return 'Americas'
    else:
        return 'Other'


**Q5)** Add a new column `Region` to `outlets_detail` DataFrame, which uses the function `region` and `.apply()` method to populate the column values. 

- The new values being generated should be based on the `Country` column in `outlets_detail` DataFrame

In [None]:
#add your code below

outlets_detail['Region'] = outlets_detail['Country'].apply(region)
outlets_detail


You have been given the below code for creating a new column `new_id` which contains strings in the format `<Region>_<location_id>`, for example: `Europe_1`

In [None]:
outlets_detail['new_id'] = outlets_detail['Region'] + '_' + outlets_detail['location_id'].astype(str)
outlets_detail

**Q6)** Use `.copy()` method to create a new DataFrame called `outlets_final`, which is the same as `outlets_detail` DataFrame.

- On `outlets_final` DataFrame, drop the original `location_id` column, use `.drop()` method with `axis=1` parameter, and set the index of the DataFrame as `new_id` column, discarding the original index:

In [None]:
#add your code below
#outlets_final = outlets_detail.copy()

outlets_final = outlets_detail.copy()
outlets_final.drop('location_id', axis=1, inplace=True)
outlets_final.set_index('new_id', inplace=True)
outlets_final


*Note how the `.drop()` method is destructive, in that running the code a second time will throw an error because the given column cannot be found. In these circumstances you may need to re-run previous code to get the DataFrame back to its previous state.*

## Part 3: Data grouping and aggregation

- Make sure you run the following code cell before you attempt any of the questions
- In the following section, you will be analysing some data related to ward profiles for each ward in Greater London. This will give an overview of the population in these small areas by presenting a range of data on the population, diversity, households, life expectancy, housing, crime, benefits, land use, deprivation, and employment

*To find out more about the dataset: [link](https://www.data.gov.uk/dataset/c7869dd4-7a05-4d5d-9e42-bdfc8da8c1b7/ward-profiles-and-atlas)

In [None]:
import pandas as pd

In [None]:
wards = pd.read_csv('data/ward-profiles-clean.csv')
wards.head()

**Q7)** Use `.groupby()` method to create a Series called `population` which contains the `sum` of the values in the `Population - 2015` column for each `Borough` in `wards` DataFrame:

In [None]:
#add your code below
#population =

population = wards.groupby('Borough')['Population - 2015'].sum()
population


**Q8)** Use `.groupby()` and `.agg()` methods to create a DataFrame called `cars_stats` which contains the `max`, `min` and `mean` of the `Cars per household - 2011` for each `Borough`in `wards` DataFrame: 

In [None]:
#add your code below
#cars_stats =

cars_stats = wards.groupby('Borough')['Cars per household - 2011'].agg(['max', 'min', 'mean'])
cars_stats


You have been given the below code that will update `cars_stats` so that `mean` is rounded to one decimal place:

In [None]:
cars_stats['mean'] = cars_stats['mean'].round(1)
cars_stats.head(3)

You have been given the below code. The following DataFrame called `transport` which contains the columns `['Borough', 'Ward', 'Average Public Transport Accessibility score - 2014', '% travel by bicycle to work - 2011']` from `wards`:

In [None]:
transport = wards[['Borough', 
                   'Ward', 
                   'Average Public Transport Accessibility score - 2014', 
                   '% travel by bicycle to work - 2011']]

transport.head()

**Q9)** Merge the columns from `cars_stats` into `transport`, such that:

- the number of rows in `transport` remains the same
- three new columns are added (`max`, `min`, `mean`)
- the values in each of these columns for all wards in a given `Borough` are the same

In [None]:
#add your code below
#transport =

transport = transport.merge(cars_stats, how='left', on='Borough')
transport


**Q10)** Order the values in `transport` DataFrame using `.sort_values()` method with `ascending=False` parameter, so the `Borough` with the Highest average cars per household  end up at the top:

In [None]:
#add your code below

transport.sort_values('mean', ascending=False, inplace=True)
transport
