# Python Data Wrangling with `pandas` Solutions

In [None]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [None]:
# Adjust some settings in matplotlib
mpl.rc('savefig', dpi=200)
plt.style.use('ggplot')
plt.rcParams['xtick.minor.size'] = 0
plt.rcParams['ytick.minor.size'] = 0

In [None]:
unemployment = pd.read_csv('../data/country_total.csv')

---

### Challenge 1: Import Data From A URL

Above, we imported the unemployment data using the `read_csv` function and a relative file path. `read_csv` is [a very flexible method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html); it also allows us to import data using a URL as the file path. 

A csv file with data on world countries and their abbreviations is located at the URL:

[https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv](https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv)

We've saved this exact URL as a string variable, `countries_url`, below.

Using `read_csv`, import the country data and save it to the variable `countries`.

---

In [None]:
countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'
countries = pd.read_csv(countries_url)

### Challenge 2: The `tail` method

DataFrames all have a method called `tail` that takes an integer as an argument and returns a new DataFrame. Before using `tail`, can you guess at what it does? Try using `tail`; was your guess correct?

In [None]:
countries.tail(10)

---

### Challenge 3: Describe `countries`

It's important to understand a few fundamentals about your data before you start work with it, including what information it contains, how large it is, and how the values are generally distributed.

Using the methods and attributes above, answer the following questions about `countries`:

* what columns does it contain?
* what does each row stand for?
* how many rows and columns does it contain?
* are there any missing values in the latitude or longitude columns? 

Hint: the `head` and `describe` functions, as well as the `shape` attribute, will be helpful here.

---

**Question**: What columns does it contain?

**Answer**: Use the `.columns` data attribute of the `countries` `DataFrame`.

In [None]:
print(countries.columns)

**Question**: What does each row stand for?

**Answer**: Each row stands for a single country

In [None]:
countries.head()

**Question**: How many rows and columns does it contain?

**Answer**: Use the `.shape` data attribute to return a `tuple` containing the (# of rows, # of columns)

In [None]:
print(countries.shape)

**Question**: Are there any missing values in the latitude or longitude columns?

**Answer**: Using the `describe` method, we see that the count for both latitude and longitude are `30.0`, which is the number of rows, so there is no missing data.

In [None]:
countries.describe()

---

### Challenge 4: Renaming a Column

The "other_feature" column in our `bacteria` table isn't very descriptive. Suppose we know that "other_feature" refers to a second set of bacteria count observations. Use the `rename` method to give "other_feature" a more descriptive name.

---

In [None]:
# Create a bacteria dataframe
bacteria = pd.DataFrame(
    {'bacteria_counts' : [632, 1638, 569, 115],
    'other_feature' : [438, 833, 234, 298]},
    index=['Firmicutes',
           'Proteobacteria',
           'Actinobacteria',
           'Bacteroidetes'])

In [None]:
# Rename "other_feature" in bacteria
bacteria.rename(columns={'other_feature': 'second_count'}, inplace=True)
bacteria

---

### Challenge 5: Indexing to Obtain a Specific Value

Both `loc` and `iloc` can be used to select a particular value if they are given two arguments. The first argument is the name (when using `loc`) or index number (when using `iloc`) of the *row* you want, while the second argument is the name or index number of the *column* you want.

Using `loc`, select "Bacteroidetes" and "bacteria_counts" to get the count of Bacteroidetes.

BONUS: how could you do the same task using `iloc`?

---

In [None]:
bacteria.loc['Bacteroidetes', 'bacteria_counts']

In [None]:
bacteria.iloc[3, 0]

In [None]:
bacteria[3:4]['bacteria_counts']

---

### Challenge 6: Indexing Multiple Rows and Columns

Both `loc` and `iloc` can be used to select subsets of columns *and* rows at the same time if they are given lists (and/or slices, for `iloc`] as their two arguments. 

Using `iloc` on the `unemployment` DataFrame, get:
* every row starting at row 4 and ending at row 7
* the 0th, 2nd, and 3rd columns

BONUS: how could you do the same task using `loc`?

---

In [None]:
unemployment.rename(columns={'month' : 'year_month'}, inplace=True)

In [None]:
unemployment.iloc[3:7, [0, 2, 3]]

In [None]:
unemployment.loc[3:7, ['country', 'year_month', 'unemployment']]

Uh-oh, those are different! Why? Because using slices in `.loc` treats the end position in the slice inclusively, while slicing with `.iloc` (and on the dataframe itself!) treats the end position in the slice exclusively (as Python lists and `numpy` does).

So, we need to do this:

In [None]:
unemployment.loc[3:6, ['country', 'year_month', 'unemployment']]

---

### Challenge 7: Another Way to Obtain the Year

If you didn't know that casting floats to ints truncates the decimals in Python, you could have used NumPy's `floor()` function. `np.floor` takes an array or `pd.Series` of floats as its argument, and returns an array or `pd.Series` where every float has been rounded down to the nearest whole number. 

Use `np.floor` to round the values in the `year_month` column down so we can cast them as integer years. Note that the types are still floats, so we'll still need to use `astype` to typecast.

---

In [None]:
unemployment['year'] = unemployment['year_month'].astype(int)
unemployment['month'] = ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int)

In [None]:
unemployment = unemployment[['country',
                             'seasonality',
                             'year_month',
                             'year',
                             'month',
                             'unemployment',
                             'unemployment_rate']]

In [None]:
# Select the "year_month" column
year_month = unemployment.loc[:, 'year_month']
year_month = unemployment['year_month']

# Use np.floor on year_month to get the years as floats
years_by_floor = np.floor(year_month)

# Cast years_by_floor to integers using astype(int)
int_years = years_by_floor.astype(int)

# Check that this gets the same answers as our first approach
# This should return True
(unemployment['year_month'].astype(int) == int_years).all()

The last line of code in the previous cell does an element-wise comparison of the values in the corresponding arrays. The `.all()` method checks whether *all* elements are `True`.

---

### Challenge 8

You may sometimes need to merge on columns with different names. To do so, use the `left_on` and `right_on` parameters, where the first listed `DataFrame` is the "left" one and the second is the "right." It might look something this:

```
pd.merge(one, two, left_on='city', right_on='city_name')
```

Suppose wanted to merge `unemployment` with a new DataFrame called `country_codes`, where the abbreviation for each country is in the column "c_code":

---

In [None]:
countries_url = 'https://raw.githubusercontent.com/dlab-berkeley/Python-Data-Wrangling/main/data/countries.csv'
countries = pd.read_csv(countries_url)
country_names = countries[['country', 'country_group', 'name_en']]

In [None]:
unemployment = pd.merge(unemployment, country_names, on='country')

In [None]:
country_codes = country_names.rename({"country": "c_code"}, axis=1).drop("country_group", axis=1)
country_codes.head()

Use `merge` to merge `unemployment` and `country_codes` on their country codes. Make sure to specify `left_on=` and `right_on=` in the call to `merge`!

In [None]:
unemployment_merged = pd.merge(unemployment, country_codes, left_on='country', right_on='c_code')
unemployment_merged.head()

---

### Challenge 9: Exploring Unemployment Rates

What are the minimum and maximum unemployment rates in our data set? Which unemployment rates are most and least common?

Hint: look at where we found the minimum and maximum years for a hint to the first question, and use `value_counts` for the second.

---

In [None]:
unemployment['unemployment_rate'].min(), unemployment['unemployment_rate'].max()

In [None]:
unemployment['unemployment_rate'].value_counts()

In [None]:
unemployment['unemployment_rate'].describe()

---

### Challenge 10: Group By Practice

Find the average unemployment rate for European Union vs. non-European Union countries. 

1. First, use `groupby()` to group on "country_group".
2. Then, select the "unemployment_rate" column,
3. Aggregate by using `.mean()` to get the average.

---

In [None]:
unemployment.groupby('country_group')['unemployment_rate'].mean()

---

### Challenge 11: Boolean Indexing

Suppose we only want to look at unemployment data from the year 2000 or later. Use Boolean indexing to create a DataFrame with only these years.

1. Select the "year" column from `unemployment`.
2. Using the year data, create a **mask**: an array of Booleans where each value is True if and only if the year is 2000 or later. Remember, you can use Boolean operators like `>`, `<`, and `==` on a column.
3. Use the mask from step 2 to index `unemployment`.

---

In [None]:
# Select the year column from unemployment
year = unemployment['year']

# Create a mask
later_or_equal_2000 = year >= 2000

# Boolean index unemployment
unemployment_2000later = unemployment[later_or_equal_2000]
unemployment_2000later.head()

---

### Challenge 12: Plot without Missing Values

Note that there are some dates for which we lack data on Spain's unemployment rate. What could you do if you wanted your plot to show only dates where both Spain and Portugal have a defined unemployment rate?

---

In [None]:
unemployment.dropna(subset=['unemployment_rate'], inplace=True)

In [None]:
unemployment.sort_values(['name_en', 'year_month'], inplace=True)
unemployment.reset_index(drop=True, inplace=True)

In [None]:
ps = unemployment[(unemployment['name_en'].isin(['Portugal', 'Spain'])) &
                  (unemployment['seasonality'] == 'sa')]
datetimes = pd.to_datetime(ps['year'].astype(str) + '/' + ps['month'].astype(str) + '/1')
ps.insert(loc=0, column='date', value=datetimes)
ps = ps[['date', 'name_en', 'unemployment_rate']]
ps.columns = ['Time Period', 'Country', 'Unemployment Rate']
ps = ps.pivot(index='Time Period', columns='Country', values='Unemployment Rate')

In [None]:
ps_nomissing = ps.dropna()
ps_nomissing.shape

In [None]:
ps.shape

In [None]:
ps_nomissing.plot(figsize=(10, 8), title='Unemployment Rate\n')