# `pandas`

This workshop's goal&mdash;which is facilitated by this Jupyter notebook&mdash;is to give attendees the confidence to use `pandas` in their research projects. Basic familiarity with Python *is* assumed.

`pandas` is designed to make it easier to work with structured data. Most of the analyses you might perform will likely involve using tabular data, e.g., from .csv files or relational databases (e.g., SQL). The `DataFrame` object in `pandas` is "a two-dimensional tabular, column-oriented data structure with both row and column labels."

If you're curious:

>The `pandas` name itself is derived from *panel data*, an econometrics term for multidimensional structured data sets, and *Python data analysis* itself. After getting introduced, you can consult the full [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/).

To motivate this workshop, we'll work with example data and go through the various steps you might need to prepare data for analysis. You'll (hopefully) realize that doing this type of work is much more difficult using Python's built-in data structures.

### Table of Contents

1 - [The DataFrame](#section1)<br>

2 - [Rename, Index, and Slice](#section2)<br>

3 - [Data Analysis](#section3)<br>

4 - [Data Manipulation](#section4)<br>

5 - [Groupby](#section5)<br>

6 - [Concatenation & Joins](#section6)<br>

7- [Plotting](#section7)<br>

## 1. The DataFrame <a id="section1"/>
The data used in these examples is available in the following [GitHub repository](https://github.com/dlab-berkeley/introduction-to-pandas). If you've [cloned that repo](https://www.atlassian.com/git/tutorials/setting-up-a-repository/git-clone), which is the recommended approach, you'll have everything you need to run this notebook. Otherwise, you can download the data file(s) from the above link. (Note: this notebook assumes that the data files are in a directory named `data/` found within your current working directory.)

We plan on working with a variety of datasets ranging from unemployment statistics to happiness measures to pokemon attributes and more.

Let's begin by importing `pandas` using the conventional abbreviation.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

The `read_csv()` function in `pandas` allows us to easily import our data. By default, it assumes the data is comma-delimited. However, you can specify the delimiter used in your data (e.g., tab, semicolon, pipe, etc.). There are several parameters that you can specify. See the documentation [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). `read_csv()` returns a `DataFrame`.

Notice that we call `read_csv()` using the `pd` abbreviation from the import statement above.

In [None]:
unemployment = pd.read_csv('data/country_total.csv')

Great! You've created a `pandas` `DataFrame`. We can look at our data by using the `.head()` method. By default, this shows the header (column names) and the first five rows. Passing an integer, $n$, to `.head()` returns that number of rows. 

In [None]:
unemployment.head()

DataFrames all have a method called `tail` that takes an integer as an argument and returns a new DataFrame. Before using `tail`, can you guess at what it does? Try using `tail`; was your guess correct?

In [None]:
unemployment.tail()

To find the number of rows, you can use the `shape` attribute.

In [None]:
unemployment.shape

There are 20,796 rows and 5 columns.

The `.info()` method is an incredibly useful diagnostic tool for when you're getting to know a new dataset.

In [None]:
unemployment.info()

`.info()` tells us:
- Number of rows and columns
- The data type of each column and the tally of each datatype.
- The number of non-null values. If those numbers are less than the number of total rows then that column has null values.
- The size of the dataframe in kilobytes.

The attributes of `.columns` and `.dtypes` return the column names and data types.

In [None]:
#Column names
unemployment.columns

In [None]:
#Data types
unemployment.dtypes

`read_csv` is [a very flexible method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html); it also allows us to import data using a URL as the file path. 

A csv file with data on world countries and their abbreviations is located at [https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv](https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv) (saved as a string variable `countries_url` below).

In [None]:
countries = pd.read_csv('https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv'
)
countries.info()

## 2. Rename, Indexing, Dropping, and Slicing <a id="section2"/>
Back to the entire unemployment data set. You may have noticed that the `month` column also includes the year. Let's go ahead and rename it.

In [None]:
unemployment.head()

In [None]:
unemployment.rename(columns={'month' : 'year_month'}, inplace=True)

The `.rename()` method allows you to modify index labels and/or column names. As you can see, we passed a `dict` to the `columns` parameter, with the original name as the key and the new name as the value. Importantly, we also set the `inplace` parameter to `True`, which modifies the *actual* `DataFrame`, not a copy of it.

To select a single column we can either use bracket (`[]`) or dot notation (referred to as *attribute access*).

In [None]:
unemployment['year_month'].head()

In [None]:
unemployment.year_month.head()

It is preferrable to use the bracket notation as a column name might inadvertently have the same name as a `DataFrame` (or `Series`) method. In addition, only bracket notation can be used to create a new column. If you try and use attribute access to create a new column, you'll create a new attribute, *not* a new column.

When selecting a single column, we have a `pandas` `Series` object, which is a single vector of data (e.g., a NumPy array) with "an associated array of data labels, called its *index*." A `DataFrame` also has an index. In our example, the indices are an array of sequential integers, which is the default. You can find them in the left-most position, without a column label.

Indices need not be a sequence of integers. They can, for example, be dates or strings. Note that indices do *not* need to be unique.

We can select multiple columns by effectively slicing the dataframe with a list of columns

In [None]:
unemployment[["year_month", "unemployment"]].head()

Deleting columns is done with the `.drop()` method. **This method is used for the index and columns** therefore we must specify `axis = 1` to tell pandas to drop a column.

In [None]:
unemployment.drop("unemployment", axis = 1).head()

In [None]:
#Multiple columns
unemployment.drop(["unemployment", "seasonality"], axis = 1).head()

This change isn't permanent because `inplace=False`. 

In [None]:
#Permanently drop a column
#unemployment.drop("unemployment", axis = 1, inplace=True)

Look at a few more useful ways to index data&mdash;that is, select rows.

`.loc` primarily works with string labels. It accepts a single label, a list (or array) of labels, or a slice of labels (e.g., `'a' : 'f'`).

Let's create a `DataFrame` to see how this works. (This is based on an [example](https://github.com/fonnesbeck/scipy2015_tutorial/blob/master/notebooks/1.%20Data%20Preparation.ipynb) from Chris Fonnesbeck's [Computational Statistics II Tutorial](https://github.com/fonnesbeck/scipy2015_tutorial).)

In [None]:
bacteria = pd.DataFrame({'bacteria_counts' : [632, 1638, 569, 115],
                         'other_feature' : [438, 833, 234, 298]},
                         index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

Notice that we pass in a `dict`, where the keys correspond to column names and the values to the data. In this example, we've also set the indices&mdash;strings in this case&mdash;to be the taxon of each bacterium.

In [None]:
bacteria

Now, if we're interested in the values (row) associated with "Actinobacteria," we can use `.loc` and the index name.

In [None]:
bacteria.loc['Actinobacteria']

This returns the column values for the specified row. Interestingly, we could have also used "positional indexing," even though the indices are strings.

In [None]:
bacteria[2:3]

The difference is that the former returns a `Series` because we selected a single lable, while the latter returns a `DataFrame` because we selected a range of positions.

Let's return to our unemployment data. Another indexing option, `.iloc`, primarily works with integer positions. To select specific rows, we can do the following.

In [None]:
unemployment.iloc[[1, 5, 6, 9]]

We can select a range of rows and specify the step value.

In [None]:
unemployment.iloc[25:50:5]

(Note: As is typical in Python, the end position is not included. Therefore, we don't see the row associated with the index 50.)

Indexing is important. You'll use it a lot. Below, we'll show how to index based on data values.



The "other_feature" column in our `bacteria` table isn't very descriptive. Suppose we know that "other_feature" refers to a second set of bacteria count observations. Use the `rename` method to give "other_feature" a more descriptive name.

In [None]:
# rename "other_feature" in bacteria
bacteria.rename(columns={'other_feature':'second_count'}, inplace=True)
bacteria

### Challenge 1A: Indexing to get a specific value

Both `loc` and `iloc` can be used to select a particular value if they are given two arguments. The first argument is the name (when using `loc`) or index number (when using `iloc`) of the *row* you want, while the second argument is the name or index number of the *column* you want.

Using `loc`, select "Bacteroidetes" and "bacteria_counts" to get the count of Bacteroidetes.

How could you do the same task using `iloc`?

### Challenge 1B: Indexing multiple rows and columns

Both `loc` and `iloc` can be used to select subsets of columns *and* rows at the same time if they are given lists (and/or slices, for `iloc`] as their two arguments. 

Using `iloc` on the `unemployment` DataFrame, get:
* every row starting at row 4 and ending at row 7
* the 0th, 2nd, and 3rd columns

Repeat same task but with `loc`

Uh-oh, those are different! Why? Because using slices in `.loc` treats the end position in the slice inclusively, while slicing with `.iloc` (and on the dataframe itself!) treats the end position in the slice exclusively (as Python lists and `numpy` does).

So, we need to do this:

### Boolean Indexing or Conditional Filtering 
Suppose we wanted to construct a dataframe for a specific country and above a certain unemployment rate threshold

**Task**: Return a dataframe where the unemployment rate is greater than 9.0 for the country of France?

- Step 1: Grab rows belonging to France
- Step 2: Grab rows where unemployment rate is greater than 9.0%
- Step 3: Use the two conditions to filter or index the `unemployment` dataframe.

In [None]:
#select unemployment rate and country columns
unemployment_rate = unemployment.unemployment_rate
country = unemployment.country

In [None]:
#create a boolean mask for un rate
unemployment_rate>9.0

The mask produces an array of boolean values equal to the length of the original dataframe

In [None]:
#create boolean mask for country
country == 'fr'

First let's filter the `unemployment_rate` series using our threshold of 9.0

In [None]:
#We pass in the boolean mask like we're slicing the dataframe
unemployment[unemployment_rate>9]

This returns a dataframe where value under unemployment rate is greater than 9.0.

In [None]:
#Country version using france
unemployment[country=='fr']

Now let's combine the two!

In [None]:
#Wrap both boolean masks in parentheses and use an & sign to make their conditions exclusive
unemployment[(unemployment.unemployment_rate>9.0) & (unemployment.country == 'fr')]

### Challenge 2: Slicing Census Data

Using the pre-loaded the census dataset featuring a collection of US counties and their socio-economic attribues, answer the following questions 

- Create a subset dataframe using loc containing the following columns: State, County, WorkAtHome, MeanCommute
- Create a dataframe of counties exclusively from each of the following counties: Kansas, Maryland, Oregon
- How many counties in California have a total population greater than 250000

In [None]:
census = pd.read_csv("data/census_data.csv")
census.head()

In [1]:
#Task1

In [2]:
#Task2

In [3]:
#Task3

## 3. Data Analysis <a id="section3"/>

Pandas is great for conducting exploratory data analysis. We need to find a the mean of a column or count the proportions of a categorical variable's items, pandas is your go-to tool. 

Let's introduce a new dataset: movies

In [None]:
path = "data/movies.csv"

In [None]:
movies = pd.read_csv(path)
movies.head()

In [None]:
movies.info()

Before we move ahead let's fix the column names

In [None]:
#Use the str operator to lower case the column names and replace the spaces with an underscore
movies.columns = movies.columns.str.lower().str.replace(" ", "_")
movies.head()

To generate a set of summary stats call the `.describe()` method

In [None]:
movies.describe()

You may have noticed that the "count" is lower for certain columns. This is because the summary statistics are based on *non-missing* values and count reflects that.

The values depend on what it's called on. If the `DataFrame` includes both numeric and object (e.g., strings) `dtype`s, it will default to summarizing the numeric data. If `describe` is called on strings, for example, it will return the count, number of unique values, and the most frequent value along with its count.

In [None]:
#describe works on series too
movies.rating.describe()

In [None]:
#mean, median
movies.rating.mean(), movies.rating.median()

What if you're interested in knowing what are the best or worst rated movies?

The `nlargest` and `nsmallest` methods can be of assistance.

In [None]:
#Show the 5 best films
movies.rating.nlargest()

The `n` parameter's default is set to 5. 

However we only see the `rating` values and not the movie titles associated with them. 


Using `nlargest` with the `columns` parameter set to `ratigin` on `movies` to achieve this.

In [None]:
#Dataframe version of .nlargest()
movies.nlargest(n = 5, columns="rating")[["title", "rating"]]

Conversely we can use `nsmallest` to output the worst films.

In [None]:
#Dataframe version of .nlargest()
movies.nsmallest(n = 5, columns="rating")[["title", "rating"]]

The `movies` dataset has some interesting categorical data that we should examine as well.

Let's find out what the various genres are at our disposal.

In [None]:
g_type = movies.genre1

In [None]:
#Show the unique pokemon types
g_type.unique()

In [None]:
#Number of uniques
g_type.nunique()

The `value_counts` method can tell us how the frequencies of each genre.

In [None]:
g_type.value_counts()

What if we're interested in proportions? Set the `normalize` parameter to `True` in the `value_counts`.

In [None]:
g_type.value_counts(normalize = True).round(2)

A common task for exploratory data analysis is looking at the correlations in your dataset. I'm interested to see if there a number of similar attributes in the dataset. 

We can use the `corr` method to return a table of all the correlations between pairs of numerical columns.

In [None]:
movies.corr()

In [None]:
#What are all the correlations for the revenue_millions column?
movies.corr()["revenue_millions"]


In [None]:
#Whats the correlation between rating and revenue_millions?
movies.corr().loc["rating", "revenue_millions"]

### Challenge 3: Analyzing Census Data

Using the census data we imported earlier, complete the following tasks.
- How many counties does each state have?
- What are the average and standard deviation for county population
- What columns have the highest correlations with poverty?

In [None]:
census.columns

In [None]:
census.head()

In [None]:
#Task1


In [4]:
#Task2


In [None]:
#Task3


## 4. Manipulating Data <a id="section4"/>

The vast majority of work done with data consists of cleaning, transforming, and other forms reshaping it to your needs. More often than not data you receive will have missing values (nulls), come in an unfriendly format, and have misspelled labels.

Data preparation is a necessary pre-requisite to tasks such as data visualization and machine learning. A machine learning model can't process missing or non-numerical data, so it's incumbent on you to feed prep your data for the model.

Luckily for us, pandas has provides relatively easy and intuitive tools which we can use to reconfigure our data.

#### Sorting

We touched on the idea of ordering data earlier with `nlargest` and `nsmallest` but sometimes we may need to turn to `sort_values` for permantently ordering data or sorting by multiple columns.

The `ascending` parameter defaults to `True` which means it orders data from least to greatest. Set it to `False` to to reverse that order. 

In [None]:
#Series version
movies.production_budget.sort_values()

In [None]:
#Dataframe version
movies.sort_values(by = "production_budget").head()

How can we make this sorting permanent?

In [None]:
#Set inplace equal to True
movies.sort_values(by = "rating", inplace=True)

Remember that whenever you use `inplace=True` you won't see an output.

In [None]:
#View sorted dataframe
movies.head()

The index can also be sorted, which we can do to actually undo the previous action.

In [None]:
movies.sort_index(inplace=True)

In [None]:
#View change
movies.head()

Now let's sort by multiple columns

In [None]:
#Initialize column list
cols = ["year", "runtime_minutes"]
#Sort data from least to greatest first by year and then by runtime
movies.sort_values(by = cols, ascending=True)[cols]

The above `DataFrame` sorts primarily with `year` and in the case of ties then defers to `runtime_minutes` as its sorting criteria.

We can also feed in a list of boolean values to `ascending` if for instance we'd like to use different orders.

In [None]:
#Sort data from greatest to least by year and then least to greatest by runtime
movies.sort_values(by = cols, ascending=[False, True])[cols]

### Null Values

Pandas marks missing data or null data as "NaN" which stands for "Not a Number." To find these null values we use the `.isnull()` method. This function returns a corresponding boolean value for each value in a `Series` or `DataFrame`.

In Python `True` is equivalent to 1 and `False` is equivalent to 0. Thus we can sum up all the values in the boolean mask with `.sum()` to give us a count for the *total* number of missing values.

In [None]:
movies.genre3.tail()

In [None]:
#Return isnull boolean mask
movies.genre3.isnull().tail()

In [None]:
#Return number of missing values in genre3
movies.genre3.isnull().sum()

In [None]:
#Number of missing values for every column
movies.isnull().sum()

In [None]:
#Using .mean() effectively tells us the percent of null values in each column
movies.isnull().mean().round(2)

Since `.isnull()` outputs a boolean mask, we can use that array to conditionally filter the `Dataframe`.

In [None]:
movies[movies.genre3.isnull()].head()

The output above returns a `DataFrame` of every row every where `genre3` has a NaN. If we want to filter out NaNs under that column place `~` at the start of the condition.

In [None]:
movies[~movies.genre3.isnull()].head()

A more formal way to get rid of nulls is to use `.dropna()`.

In [None]:
movies.genre3.dropna()

In [None]:
#DataFrame version
movies.dropna(subset=["genre3"])

**If we wanted to permanently drop nulls what do you guess would be the way to do so?**

If you said `inplace=True` then you're right!!!

In [None]:
# movies.dropna(subset=['genre3'], inplace=True)

Now `movies` has no NaNs under the `genre3` column.

Sometimes we may not want to get rid of nulls but rather replace them with our own preferred value. This is what's referred to as imputation. It's a technique typically used in machine learning to "save" data that missing data by replacing NaNs with an estimated value — a mean typically used.

`.fillna()` replaces nulls with an input value.

In [None]:
#Replace the missing metascore values with its mean
meta_mean = movies.metascore.mean()
movies.metascore.fillna(meta_mean)

How can we go about replacing the missing `genre2` and `genre3` values with a string that says "no_genre"?

In [None]:
repl = "no_genre"
movies.genre2.fillna(repl)

In [None]:
movies.genre3.fillna(repl)

### Changing Data Types

There are instances where data is not encoded in the right way. For instance, numerical data such as dollars and percents presented as strings. 

In [None]:
#Create fake dataframe 
percent_sales = pd.DataFrame({"percents":["30.2", "97.5", "61.0"],
                               "revenue": ["$3438", "$2393", "$1892"]})
percent_sales

In [None]:
#View data types
percent_sales.dtypes

Changing the type of `percents` can be done with `.astype()` and passing in "float" as the desired data type

In [None]:
percent_sales.percents.astype(float)

`.astype()` does not has `inplace=True` so to make it permanent, we overwrite the column. 

In [None]:
percent_sales["percents"] = percent_sales.percents.astype(float)

Now revenue's turn.

In [None]:
percent_sales.revenue.astype(float)

We get an error! 

That's because of the pesky $ signs. We can only convert string representations of numbers to floats, not non-numerical characters.

This necessitates removing the $ sign which introduces us to `.str` which is essentially a method allows us to use typical string methods such as `.lower()` and `.title()` on `Series`.

We first call `.str` then `.replace()` which we use to get rid of the $ signs.

In [None]:
percent_sales.revenue.str.replace("$", "")

Now we can convert "revenue" to a float.

In [None]:
percent_sales.revenue.str.replace("$", "").astype(float)

Go back to the unemployment data

In [None]:
unemployment.head()

We need to **split `year_month` into two separate columns.** Above, we saw that this column is type (technically, `dtype`) `float64`. We can extract the year using with `.astype()` method. This allows for type casting&mdash;basically converting from one type to another. We'll then subtract this value from `year_month`&mdash;to get the decimal portion of the value&mdash;and multiply the result by 100 and convert to `int`.

In [None]:
unemployment['year'] = unemployment['year_month'].astype(int)

In this case, we're casting the floating point values to integers. In Python, this [truncates the decimals](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex).

Finally, let's create our **month** variable as described above. (Because of the truncating that occurs when casting to `int`, we first round the values to the nearest whole number.)

In [None]:
unemployment['month'] = ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int)
unemployment.head()

### Inter Column operations

The great thing about data is that you create more data from what it's in front of you.

This dataset is missing some important information such as the profit line of each film. The good news is we can derive those numbers from what we have in front of us.

Let's derive the profit figures for our set of films. First thing we need to do is fix the revenue column

In [None]:
#Multiple revenue_millions by a million
movies["revenue_millions"] *= 1_000_000

In [None]:
#Create a new column in movies called profit by subtracting production_budget from revenue_millions
movies["profit"] = movies["revenue_millions"] - movies["production_budget"]
movies.head()

### Challenge 4

Time for some pokemon data analysis. Using the `pokemon` dataset complete the following tasks

- What are the 10 pokemon with the highest and lowest hp values? Show just the name
- Which 3 columns have the most null values?
- Create a new column that represents `height_m` in inches. Reminder: There 39.37 inches in a meter.
- What are is the average speed for pokemon that are marked as 1 and 0 under `is_legendary`

In [None]:
pokemon = pd.read_csv("data/pokemon.csv")
pokemon.head()

In [None]:
#Task1 


In [None]:
#Task2


In [None]:
#Task3


In [None]:
#Task4


## 5. Groupby  <a id="section5"/>


What if we'd like to apply certain operations based on a categorization of data? For instance deriving the average rating value for each film genre?

In other words we need to *group* the films *by* their genre designation.

Which we can do with the `.groupby()` method.

In [None]:
#Calculate the average rating by genre.
movies.groupby("genre1").rating.mean()

Let's explain what just happened. We start with our `DataFrame`. We tell `pandas` that we want to group the data by generation;that's what goes in the parentheses. Next, we need to tell it what column we'd like to perform the `.mean()` operation on. In this case, it's the `rating` attribute.

In [None]:
#Repeat but with .describe()
movies.groupby("genre1").rating.describe()

We can also groupby multiple columns as well.

In [None]:
#Initialize list of columns to group by with
cols = ["genre1", "year"]
movies.groupby(cols).rating.mean()

In [None]:
#Use as_index = False to return a dataframe
movies.groupby(cols, as_index=False).rating.mean()

### Challenge 5:

- Create a series that order generations of pokemon by their average attack rating from greatest to least
- Replace the NaNs in `type2` with "no_type" and then group by `type1` and `type2` and derive the median `speed`

In [None]:
#Task1


In [None]:
#Task2


## 6. Merging and Concatenation <a id="section6"/>

How can we connect two different dataset along their columns or rows? Similar to how we concatenate strings, we can concatenate dataframes.

Load in the two different datasets from the [world happiness report](https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv)

In [None]:
happiness2010 = pd.read_csv("data/word_happiness_report_2010.csv")
happiness2009 = pd.read_csv("data/word_happiness_report_2009.csv")
happiness2009.head()

In [None]:
#Num rows
happiness2009.shape[0], happiness2010.shape[0]

Both dataframes are structured the same and represent the same information but for two different years.

If we wanted to conduct an operation that looks a change in a metric from one year to the next for a country we first need to combine or concatenate the two dataframes. 

In [None]:
#First assemble the dataframes in a list
df_list = [happiness2009, happiness2010]

`pd.concat()` is the method for this task.

In [None]:
#Pass in df_list to the concatenation funcion and set axis = 0
happiness = pd.concat(df_list, axis = 0)
happiness.head()

In [None]:
#Reset the index to get rid of duplicates
happiness.reset_index(drop=True, inplace=True)

In [None]:
#Num rows
happiness.shape[0]

The `axis` method is a crucial for this function because we are telling pandas to concatenate the dataframes vertically instead of horizontally. We do this because they have the same columns and thus should combined along that axis.


Setting `axis` to 1 would attach the dataframes side by side.

Now let's bring back the unemployment data and connect it with the happiness data.

This time when we combine the data, we are going to be merging them together, aka joining them.

Our two datasets will be the 2010 happiness data and a version of the unemployment data that pulls the seasonally adjusted median unemployment rate for the year 2010 for each country

In [None]:
#Create 2010 seasonally adjusted subset
unemployment2010_sa = unemployment[(unemployment.seasonality == 'sa') & (unemployment.year == 2010)]

In [None]:
unemployment2010_sa.head()

In [None]:
#Group by country derive median
median_unemployment = unemployment2010_sa.groupby("country", as_index=False).unemployment_rate.median()
median_unemployment.head()

`pandas` includes an easy-to-use `.merge()` function. Let's use it to **merge  `median_unemployment` and `happiness2010` using country codes**

In [None]:
merged = pd.merge(happiness2010,median_unemployment, left_on="country_code_name", right_on="country")
merged.head()

Merging is often more complex than this example. If you want to merge on multiple columns, you can pass a list of column names to the `on` parameter.

```
pd.merge(first, second, on=['name', 'id'])
```

The `how` parameter is set to "inner" which means that the output will only include values that appear in both of the columns the dataframes are being joined on.

For more information on merging, check the [documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging).

In [None]:
#Num rows can be a good indicator of how well our join did.
merged.shape

Is there a relationship between unemployment rate and other variables?

In [None]:
merged.corr()["unemployment_rate"]

We can save this dataset with the `.to_csv()` method

In [None]:
# merged.to_csv("happiness_unemployment.csv")

## 7. Plotting With Pandas  <a id="section7"/>

The best way to get a sense of this data is to **plot it.** Next, we'll start to look at some basic plotting with `pandas`. Before we begin, let's sort the data by country and date. This is good practice and is especially important when using `pandas`'s `.plot()` method because the x-axis values are based on the indices. When we sort, the index values remain unchanged. Thus, we need to reset them. The `drop` parameter tells `pandas` to construct a `DataFrame` *without* adding a column.

In [None]:
unemployment.sort_values(['country', 'year_month'], inplace=True)
unemployment.reset_index(drop=True, inplace=True)

Let's take a look at Spain's unemployment rate (only because it was the highest) across time.

In [None]:
spain = unemployment[(unemployment['country'] == 'es') &
                     (unemployment['seasonality'] == 'sa')]

spain["country"] = "Spain"

In [None]:
spain['unemployment_rate'].plot(figsize=(10, 8), color='#348ABD')

Note that the values along the x-axis represent the indices associated with Spain in the sorted `unemployment` `DataFrame`. Wouldn't it be nice if, instead, we could **show the time period** associated with the various unemployment rates for Spain? It might also be interesting to **compare** Spain's unemployment rate with its neighbor to the west, Portugal.

Let's first create a `DataFrame` that contains the unemployment data for both countries.

In [None]:
ps = unemployment[(unemployment['country'].isin(['pt', 'es'])) &
                  (unemployment['seasonality'] == 'sa')]

For a quick tasks that involving replacing data values, use a dictionary where the old values are the keys and the new ones are the values.

In [None]:
#Initialize dictionary which we use to turn pt -> Portugal and es -> Spain
country_map = {"pt":"Portugal", "es":"Spain"}

ps["country"] = ps["country"].map(country_map)

Next, we'll **generate time series data** by converting our years and months into `datetime` objects. `pandas` provides a `to_datetime()` function that makes this relatively simple. It converts an argument&mdash;a single value or an array of values&mdash;to `datetime`. (Note that the return value [depends on the input](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html).) If we were interested in March 23, 1868, for example, we could do the following.

In [None]:
pd.to_datetime('1868/3/23')

The argument doesn't necessarily have to be specified in the `yyyy/mm/dd` format. You could list it as `mm/dd/yyyy`, but it's a good idea to be explicit. As a result, we pass in a valid string format.

In [None]:
pd.to_datetime('3/23/1868', format='%m/%d/%Y')

Let's create the `datetime` object and add it to the `DataFrame` as a column named `date`. For this, we'll use the `DataFrame.insert()` method.

In [None]:
ps.insert(loc=0, column='date',
          value=pd.to_datetime(ps['year'].astype(str) + '/' + ps['month'].astype(str) + '/1'))

Finally, let's only keep certain columns, rename them, and reshape the `DataFrame`.

In [None]:
ps = ps[['date', 'country', 'unemployment_rate']]
ps.columns = ['Time Period', 'Country', 'Unemployment Rate']
ps = ps.pivot(index='Time Period', columns='Country', values='Unemployment Rate')
ps.tail()

In [None]:
ps.head()

Notice the indices.

In [None]:
ps.plot(figsize=(10, 8), title='Unemployment Rate\n')