## In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course.

## Reading DataFrames from multiple files
When data is spread among several files, you usually invoke pandas' read_csv() (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian.

The column labels of each DataFrame are NOC, Country, & Total where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from IPython import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [3]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv('Summer Olympic medals/Bronze.csv')
bronze.head()
# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv('Summer Olympic medals/Silver.csv')
silver.head()
# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv('Summer Olympic medals/Gold.csv')
gold.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


Unnamed: 0,NOC,Country,Total
0,USA,United States,1195.0
1,URS,Soviet Union,627.0
2,GBR,United Kingdom,591.0
3,FRA,France,461.0
4,GER,Germany,350.0


Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


## Reading DataFrames from multiple files in a loop
As you saw in the video, loading data from multiple files into DataFrames is more efficient in a loop or a list comprehension.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

In [4]:
# Create the list of file names: filenames
filenames = ['Summer Olympic medals/Gold.csv', 'Summer Olympic medals/Silver.csv', 'Summer Olympic medals/Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
dataframes[0].head()
dataframes[1].head()
dataframes[2].head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


Unnamed: 0,NOC,Country,Total
0,USA,United States,1195.0
1,URS,Soviet Union,627.0
2,GBR,United Kingdom,591.0
3,FRA,France,461.0
4,GER,Germany,350.0


Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0
3,FRA,France,475.0
4,GER,Germany,454.0


## Combining DataFrames from multiple data files
In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

In [5]:
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

# Print the head of medals
medals.head()

Unnamed: 0,NOC,Country,Gold,Silver,Bronze
0,USA,United States,2088.0,1195.0,1052.0
1,URS,Soviet Union,838.0,627.0,584.0
2,GBR,United Kingdom,498.0,591.0,505.0
3,FRA,France,378.0,461.0,475.0
4,GER,Germany,407.0,350.0,454.0


## Reindexing using another DataFrame Index
Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981 and names_1881 loaded for you.

The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:

names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.

In [7]:
names_1881 = pd.read_csv('Baby names/names1881.csv', header = None)
names_1881.head()

Unnamed: 0,0,1,2
0,Mary,F,6919
1,Anna,F,2698
2,Emma,F,2034
3,Elizabeth,F,1852
4,Margaret,F,1658


In [8]:
names_1881.columns = ['name', 'gender', 'count']
names_1881.set_index(['name', 'gender'], inplace = True)
names_1881.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Mary,F,6919
Anna,F,2698
Emma,F,2034
Elizabeth,F,1852
Margaret,F,1658


In [9]:
names_1981 = pd.read_csv('Baby names/names1981.csv', header = None)
names_1981.head()

Unnamed: 0,0,1,2
0,Jennifer,F,57032
1,Jessica,F,42519
2,Amanda,F,34370
3,Sarah,F,28162
4,Melissa,F,28003


In [10]:
names_1981.columns = ['name', 'gender', 'count']
names_1981.set_index(['name', 'gender'], inplace = True)
names_1981.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Jennifer,F,57032
Jessica,F,42519
Amanda,F,34370
Sarah,F,28162
Melissa,F,28003


In [11]:
names_1881.shape
names_1981.shape

(1935, 1)

(19455, 1)

In [12]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
common_names.shape

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
common_names.shape

(1935, 1)

(1587, 1)

## Broadcasting in arithmetic formulas
In this exercise, you'll work with weather data pulled from wunderground.com. The DataFrame weather has been pre-loaded along with pandas as pd. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like +, -, *, and /) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

In [14]:
weather_data = pd.read_csv('pittsburgh2013.csv', parse_dates = True, index_col = 'Date')
weather_data.head()

Unnamed: 0_level_0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,32,28,21,30,27,16,100,89,77,30.1,...,10,6,2,10,8,,0.0,8,Snow,277
2013-01-02,25,21,17,14,12,10,77,67,55,30.27,...,10,10,10,14,5,,0.0,4,,272
2013-01-03,32,24,16,19,15,9,77,67,56,30.25,...,10,10,10,17,8,26.0,0.0,3,,229
2013-01-04,30,28,27,21,19,17,75,68,59,30.28,...,10,10,6,23,16,32.0,0.0,4,,250
2013-01-05,34,30,25,23,20,16,75,68,61,30.42,...,10,10,10,16,10,23.0,0.21,5,,221


In [15]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather_data[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')

# Print first 5 rows of temps_c
temps_c.head()

Unnamed: 0_level_0,Min TemperatureC,Mean TemperatureC,Max TemperatureC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,-6.111111,-2.222222,0.0
2013-01-02,-8.333333,-6.111111,-3.888889
2013-01-03,-8.888889,-4.444444,0.0
2013-01-04,-2.777778,-2.222222,-1.111111
2013-01-05,-3.888889,-1.111111,1.111111


## Computing percentage growth of GDP
Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.

The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. 

__Instruction__
- Read the file 'GDP.csv' into a DataFrame called gdp. Use parse_dates=True and index_col='DATE'.
- Create a DataFrame post2008 by slicing gdp such that it comprises all rows from 2008 onward.
- Print the last 8 rows of the slice post2008. This has been done for you. This data has quarterly frequency so the indices are   separated by three-month intervals.
- Create the DataFrame yearly by resampling the slice post2008 by year. Remember, you need to chain .resample() (using the       alias 'A' for annual frequency) with some kind of aggregation; you will use the aggregation method .last() to select the last   element when resampling.
- Compute the percentage growth of the resampled DataFrame yearly with .pct_change() * 100.


In [16]:
gdp_usa = pd.read_csv('GDP/gdp_usa.csv', parse_dates = True , index_col = 'DATE')
gdp_usa.head()

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [17]:
# Slice all the gdp_use data from 2008 onward: post2008
post2008 = gdp_usa.loc['2008':]

# Print the last 8 rows of post2008
post2008.tail(8)

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
yearly

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
yearly

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2014-07-01,17569.4
2014-10-01,17692.2
2015-01-01,17783.6
2015-04-01,17998.3
2015-07-01,18141.9
2015-10-01,18222.8
2016-01-01,18281.6
2016-04-01,18436.5


Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2008-12-31,14549.9
2009-12-31,14566.5
2010-12-31,15230.2
2011-12-31,15785.3
2012-12-31,16297.3
2013-12-31,16999.9
2014-12-31,17692.2
2015-12-31,18222.8
2016-12-31,18436.5


Unnamed: 0_level_0,VALUE,growth
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-12-31,14549.9,
2009-12-31,14566.5,0.11409
2010-12-31,15230.2,4.556345
2011-12-31,15785.3,3.644732
2012-12-31,16297.3,3.243524
2013-12-31,16999.9,4.311144
2014-12-31,17692.2,4.072377
2015-12-31,18222.8,2.999062
2016-12-31,18436.5,1.172707


## Converting currency of stocks
In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

__Instructions__
- Read the DataFrames sp500 & exchange from the files 'sp500.csv' & 'exchange.csv' respectively..
- Use parse_dates=True and index_col='Date'.
- Extract the columns 'Open' & 'Close' from the DataFrame sp500 as a new DataFrame dollars and print the first 5 rows.
- Construct a new DataFrame pounds by converting US dollars to British pounds. You'll use the .multiply() method of dollars       with exchange['GBP/USD'] and axis='rows'
- Print the first 5 rows of the new DataFrame pounds. This has been done for you, so hit 'Submit Answer' to see the results!.

In [18]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('sp500.csv', parse_dates = True, index_col = 'Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('exchange.csv', parse_dates = True, index_col = 'Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
dollars.head()

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis = 'rows')

# Print the head of pounds
pounds.head()

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,2058.899902,2058.199951
2015-01-05,2054.439941,2020.579956
2015-01-06,2022.150024,2002.609985
2015-01-07,2005.550049,2025.900024
2015-01-08,2030.609985,2062.139893


Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,1340.364425,1339.90875
2015-01-05,1348.616555,1326.389506
2015-01-06,1332.51598,1319.639876
2015-01-07,1330.562125,1344.063112
2015-01-08,1343.268811,1364.126161
