## Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

## Appending DataFrames with ignore_index
In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981 and names_1881 are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame .append() method to make a DataFrame combined_names. To distinguish rows from the original two DataFrames, you'll add a 'year' column to each with the year (1881 or 1981 in this case). In addition, you'll specify ignore_index=True so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
names_1881 = pd.read_csv('Baby names/names1881.csv', header = None)
names_1881.head()

Unnamed: 0,0,1,2
0,Mary,F,6919
1,Anna,F,2698
2,Emma,F,2034
3,Elizabeth,F,1852
4,Margaret,F,1658


In [4]:
names_1881.columns = ['name', 'gender', 'count']
names_1881.head()

Unnamed: 0,name,gender,count
0,Mary,F,6919
1,Anna,F,2698
2,Emma,F,2034
3,Elizabeth,F,1852
4,Margaret,F,1658


In [5]:
names_1981 = pd.read_csv('Baby names/names1981.csv', header = None)
names_1981.head()

Unnamed: 0,0,1,2
0,Jennifer,F,57032
1,Jessica,F,42519
2,Amanda,F,34370
3,Sarah,F,28162
4,Melissa,F,28003


In [6]:
names_1981.columns = ['name', 'gender', 'count']
names_1981.head()

Unnamed: 0,name,gender,count
0,Jennifer,F,57032
1,Jessica,F,42519
2,Amanda,F,34370
3,Sarah,F,28162
4,Melissa,F,28003


__Instructions__
- Create a 'year' column in the DataFrames names_1881 and names_1981, with values of 1881 and 1981 respectively. Recall that     assigning a scalar value to a DataFrame column broadcasts that value throughout.
- Create a new DataFrame called combined_names by appending the rows of names_1981 underneath the rows of names_1881. Specify     the keyword argument ignore_index=True to make a new RangeIndex of unique integers for each row.
- Print the shapes of all three DataFrames. This has been done for you.
- Extract all rows from combined_names that have the name 'Morgan'. To do this, use the .loc[] accessor with an appropriate       filter. The relevant column of combined_names here is 'name'.

In [7]:
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [8]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
names_1981.shape
names_1881.shape
combined_names.shape

# Print all rows that contain the name 'Morgan'
combined_names[combined_names.loc[:, 'name'] == 'Morgan']

(19455, 4)

(1935, 4)

(21390, 4)

Unnamed: 0,name,gender,count,year
1283,Morgan,M,23,1881
2096,Morgan,F,1769,1981
14390,Morgan,M,766,1981


## Concatenating pandas DataFrames along column axis
The function pd.concat() can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 or axis='columns'.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

In [16]:
month = ['Jan', 'Apr', 'Jul', 'Oct']
weather_max = pd.DataFrame({'Max TemperatureF':[68, 89, 91, 84]}, index = month)
weather_max.index.name = 'Month'

In [17]:
weather_max

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68
Apr,89
Jul,91
Oct,84


In [18]:
weather_mean = pd.DataFrame({'Mean TemperatureF':[53.1       , 70.        , 34.93548387, 28.71428571, 32.35483871,
       72.87096774, 70.13333333, 35.        , 62.61290323, 39.8       ,
       55.4516129 , 63.76666667]}, index = ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'])
weather_mean.index.name = 'Month'
weather_mean

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Apr,53.1
Aug,70.0
Dec,34.935484
Feb,28.714286
Jan,32.354839
Jul,72.870968
Jun,70.133333
Mar,35.0
May,62.612903
Nov,39.8


In [21]:
import warnings
warnings.filterwarnings('ignore')

In [22]:
# Create a list of weather_max and weather_mean
weather_list = [weather_max, weather_mean]

# Concatenate weather_list horizontally
weather = pd.concat(weather_list, axis = 1)

# Print weather
weather

Unnamed: 0,Max TemperatureF,Mean TemperatureF
Apr,89.0,53.1
Aug,,70.0
Dec,,34.935484
Feb,,28.714286
Jan,68.0,32.354839
Jul,91.0,72.870968
Jun,,70.133333
Mar,,35.0
May,,62.612903
Nov,,39.8


## Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.

pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.



__Instructions__
- Iterate over medal_types in the for loop.
- Inside the for loop:
  - Create file_name using string interpolation with the loop variable medal. This has been done for you. The expression           "%s_top5.csv" % medal evaluates as a string with the value of medal replacing %s in the format string.
  - Create the list of column names called columns. This has been done for you.
  - Read file_name into a DataFrame called medal_df. Specify the keyword arguments header=0, index_col='Country', and               names=columns to get the correct row and column Indexes.
  - Append medal_df to medals using the list .append() method.
- Concatenate the list of DataFrames medals horizontally (using axis='columns') to create a single DataFrame called medals.       Print it in its entirety.

In [44]:
medals = []
medal_types = ['bronze', 'silver', 'gold']

In [45]:
for medal in medal_types:

    # Create the file name: file_name
    file_name = "Summer Olympic medals/%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis='columns')

# Print medals
medals

Unnamed: 0,bronze,silver,gold
France,475.0,461.0,
Germany,454.0,,407.0
Italy,,394.0,460.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0
United States,1052.0,1195.0,2088.0


## Concatenating vertically to get MultiIndexed rows
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the keys parameter in the call to pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset. Once again, pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

In [50]:
medals = []

In [51]:
for medal in medal_types:

    file_name = "Summer Olympic medals/%s_top5.csv" % medal

    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals
medals

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United States,1052.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,France,475.0
bronze,Germany,454.0
silver,United States,1195.0
silver,Soviet Union,627.0
silver,United Kingdom,591.0
silver,France,461.0
silver,Italy,394.0


## Slicing MultiIndexed DataFrames
This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices

__Instructions__
- Create a new DataFrame medals_sorted with the entries of medals sorted. Use .sort_index(level=0) to ensure the Index is         sorted suitably.
- Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
- Create an alias for pd.IndexSlice called idx. A slicer pd.IndexSlice is required when slicing on the inner level of a           MultiIndex.
- Slice all the data on medals won by the United Kingdom. To do this, use the .loc[] accessor with idx[:,'United Kingdom'], :.

In [52]:
# Sort the entries of medals
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
medals_sorted.loc[('bronze','Germany')]

# Print data about silver medals
medals_sorted.loc['silver']

# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
medals_sorted.loc[idx[:,'United Kingdom'], :]

Total    454.0
Name: (bronze, Germany), dtype: float64

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,461.0
Italy,394.0
Soviet Union,627.0
United Kingdom,591.0
United States,1195.0


Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,United Kingdom,505.0
gold,United Kingdom,498.0
silver,United Kingdom,591.0


---
__Great work! It looks like only the United States and the Soviet Union have won more Silver medals than the United Kingdom.__

## Concatenating DataFrames with inner join
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.

The DataFrames bronze, silver, and gold have been pre-loaded for you.

Your task is to compute an inner join.

In [57]:
gold = pd.read_csv('Summer Olympic medals/Gold.csv', index_col = 'Country')
gold.drop('NOC', axis = 1, inplace = True)
gold.sort_values('Total', ascending = False, inplace = True)

In [58]:
gold= gold.head(5)

In [59]:
gold

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
United States,2088.0
Soviet Union,838.0
United Kingdom,498.0
Italy,460.0
Germany,407.0


In [60]:
silver = pd.read_csv('Summer Olympic medals/Silver.csv', index_col = 'Country')
silver.drop('NOC', axis = 1, inplace = True)
silver.sort_values('Total', ascending = False, inplace = True)
silver= silver.head(5)
silver

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
United States,1195.0
Soviet Union,627.0
United Kingdom,591.0
France,461.0
Italy,394.0


In [61]:
bronze = pd.read_csv('Summer Olympic medals/Bronze.csv', index_col = 'Country')
bronze.drop('NOC', axis = 1, inplace = True)
bronze.sort_values('Total', ascending = False, inplace = True)
bronze= bronze.head(5)
bronze

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
United States,1052.0
Soviet Union,584.0
United Kingdom,505.0
France,475.0
Germany,454.0


__Instruction__
- Construct a list of DataFrames called medal_list with entries bronze, silver, and gold.
- Concatenate medal_list horizontally with an inner join to create medals.
- Use the keyword argument keys=['bronze', 'silver', 'gold'] to yield suitable hierarchical indexing.
- Use axis=1 to get horizontal concatenation.
- Use join='inner' to keep only rows that share common index labels.
- Print the new DataFrame medals.

In [63]:
# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')

# Print medals
medals

Unnamed: 0_level_0,bronze,silver,gold
Unnamed: 0_level_1,Total,Total,Total
Country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
United States,1052.0,1195.0,2088.0
Soviet Union,584.0,627.0,838.0
United Kingdom,505.0,591.0,498.0


---
---
__Well done! France, Italy, and Germany got dropped as part of the join since they are not present in each of bronze, silver, and gold. Therefore, the final DataFrame has only the United States, Soviet Union, and United Kingdom.__