## Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas' read_csv() (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the Guardian.

The column labels of each DataFrame are NOC, Country, & Total where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).



In [4]:
# Import pandas
import pandas as pd

path = 'K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\'

# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Bronze.csv')

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Silver.csv')

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Gold.csv')

# Print the first five rows of gold
print(gold.head())

   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


## Reading DataFrames from multiple files in a loop
As you saw in the video, loading data from multiple files into DataFrames is more efficient in a loop or a list comprehension.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

Here, you'll continue working with The Guardian's Olympic medal dataset.

In [3]:
# Create the list of file names: filenames
filenames = ['K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Gold.csv', 'K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Silver.csv', 'K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())


   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


## Combining DataFrames from multiple data files
In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

In [4]:
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total'] 
medals['Bronze'] = bronze['Total']

# Print the head of medals
print(medals.head())

   NOC         Country    Gold  Silver  Bronze
0  USA   United States  2088.0  1195.0  1052.0
1  URS    Soviet Union   838.0   627.0   584.0
2  GBR  United Kingdom   498.0   591.0   505.0
3  FRA          France   378.0   461.0   475.0
4  GER         Germany   407.0   350.0   454.0


## using Listing Comprehehension

In [5]:
dataframe2 = pd.concat([pd.read_csv(f) for f in filenames])
dataframe2.equals(dataframes)




False

# using Glob modules

In [6]:
from glob import glob
filesnames = glob('sales*.csv')

In [7]:
filenames

['K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Gold.csv',
 'K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Silver.csv',
 'K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\Bronze.csv']

In [8]:
globadf = pd.concat([pd.read_csv(f) for f in filenames]) 

In [9]:
type(globadf)


pandas.core.frame.DataFrame

## Sorting DataFrame with the Index & columns

It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are .sort_index() and .sort_values().

In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.


In [10]:
# Import pandas
import pandas as pd

# Read 'monthly_max_temp.csv' into a DataFrame: weather1
weather1 = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\maxtemperture.csv',index_col='Month')

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 =weather1.sort_values('Max TemperatureF')

# Print the head of weather4
print(weather4.head())

        Max TemperatureF
Month                   
Jan                   68
Feb                   60
Mar                   68
Apr                   84
May                   88
           Max TemperatureF
Month                      
Apr                      84
Aug                      86
Dec                      68
Feb                      60
Jan                      68
             Max TemperatureF
Month                        
Sep                        90
Oct                        84
Nov                        72
May                        88
Mar                        68
       Max TemperatureF
Month                  
Feb                  60
Jan                  68
Mar                  68
Dec                  68
Nov                  72


## rindexing DataFrame from a list
Sorting methods are not the only way to change DataFrame Indexes. There is also the .reindex() method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples, which you may recall from the pandas Foundations course).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Sep. This data has been loaded into a DataFrame called weather1 and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

You'll initially use a list of all twelve month abbreviations and subsequently apply the .ffill() method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as year.

In [11]:
# Import pandas
import pandas as pd
import numpy as np
weather1  = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\temperture.csv',index_col='Month')



year = ['Jan',
        'Feb',
        'Mar',
        'Apr',
        'May',
        'Jun',
        'Jul',
        'Aug',
        'Sep',
        'Oct',
        'Nov',
        'Dec']



# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
print(weather2)

# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
print(weather3)


       Mean TemperatureF
Month                   
Jan            32.133333
Feb                  NaN
Mar                  NaN
Apr            61.956044
May                  NaN
Jun                  NaN
Jul            68.934783
Aug                  NaN
Sep                  NaN
Oct            43.434783
Nov                  NaN
Dec                  NaN
       Mean TemperatureF
Month                   
Jan            32.133333
Feb            32.133333
Mar            32.133333
Apr            61.956044
May            61.956044
Jun            61.956044
Jul            68.934783
Aug            68.934783
Sep            68.934783
Oct            43.434783
Nov            43.434783
Dec            43.434783


In [12]:
weather1.head()

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Apr,61.956044
Jan,32.133333
Jul,68.934783
Oct,43.434783


## Reindexing using another DataFrame Index

nother common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981 and names_1881 loaded for you.

The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:

names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
As you can see by looking at their shapes, which have been printed in the IPython Shell, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.

In [13]:
#Init

names_1981 = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))

names_1881 = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))


In [14]:
# Import pandas
import pandas as pd

# Reindex names_1981 with index of names_1881: common_names
#Create a new DataFrame common_names by reindexing names_1981 using the Index of the DataFrame names_1881 of older names.
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
print(common_names.shape)

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
print(common_names.shape)


(1935, 1)
(1587, 1)


## 

## 

## Broadcasting in arithmetic formulas
In this exercise, you'll work with weather data pulled from wunderground.com. The DataFrame weather has been pre-loaded along with pandas as pd. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like +, -, *, and /) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

In [15]:
## init
import pandas as pd

weather = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\/pittsburgh2013.csv',index_col='Date')

In [16]:
print(weather.head())

          Max TemperatureF  Mean TemperatureF  Min TemperatureF  \
Date                                                              
2013-1-1                32                 28                21   
2013-1-2                25                 21                17   
2013-1-3                32                 24                16   
2013-1-4                30                 28                27   
2013-1-5                34                 30                25   

          Max Dew PointF  MeanDew PointF  Min DewpointF  Max Humidity  \
Date                                                                    
2013-1-1              30              27             16           100   
2013-1-2              14              12             10            77   
2013-1-3              19              15              9            77   
2013-1-4              21              19             17            75   
2013-1-5              23              20             16            75   

          Mean Hum

In [17]:
# Excerise

# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF','Mean TemperatureF','Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns =temps_c.columns.str.replace('F','C')

# Print first 5 rows of temps_c
print(temps_c.head())

          Min TemperatureC  Mean TemperatureC  Max TemperatureC
Date                                                           
2013-1-1         -6.111111          -2.222222          0.000000
2013-1-2         -8.333333          -6.111111         -3.888889
2013-1-3         -8.888889          -4.444444          0.000000
2013-1-4         -2.777778          -2.222222         -1.111111
2013-1-5         -3.888889          -1.111111          1.111111


# Computing percentage growth of GDP
Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.

The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP. For a refresher on resampling, check out the relevant material from pandas Foundations.

In [18]:
import pandas as pd

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\/gdp_usa.csv',parse_dates=True, index_col='DATE')


In [19]:
gdp.head()

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [20]:

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008':]

# Print the last 8 rows of post2008
print(post2008.tail(8))

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
print(yearly)

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change()*100

# Print yearly again
print(yearly)

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5
              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5
              VALUE    growth
DATE                         
2008-12-31  14549.9       NaN
2009-12-31  14566.5  0.114090
2010-12-31  15230.2  4.556345
2011-12-31  15785.3  3.644732
2012-12-31  16297.3  3.243524
2013-12-31  16999.9  4.311144
2014-12-31  17692.2  4.072377
2015-12-31  18222.8  2.999062
2016-12-31  18436.5  1.172707


## Converting currency of stocks

n this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

In [21]:
# Import pandas
import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\sp500.csv',parse_dates=True,index_col='Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\exchange.csv',parse_dates=True,index_col='Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open','Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'],axis='rows')

# Print the head of pounds
print(pounds.head())

                   Open        Close
Date                                
2015-01-02  2058.899902  2058.199951
2015-01-05  2054.439941  2020.579956
2015-01-06  2022.150024  2002.609985
2015-01-07  2005.550049  2025.900024
2015-01-08  2030.609985  2062.139893
                   Open        Close
Date                                
2015-01-02  1340.364425  1339.908750
2015-01-05  1348.616555  1326.389506
2015-01-06  1332.515980  1319.639876
2015-01-07  1330.562125  1344.063112
2015-01-08  1343.268811  1364.126161


## 

In [22]:
# Import pandas
import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\sp500.csv',parse_dates=True,index_col='Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\exchange.csv',parse_dates=True,index_col='Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open','Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'],axis='rows')

# Print the head of pounds
print(pounds.head())

                   Open        Close
Date                                
2015-01-02  2058.899902  2058.199951
2015-01-05  2054.439941  2020.579956
2015-01-06  2022.150024  2002.609985
2015-01-07  2005.550049  2025.900024
2015-01-08  2030.609985  2062.139893
                   Open        Close
Date                                
2015-01-02  1340.364425  1339.908750
2015-01-05  1348.616555  1326.389506
2015-01-06  1332.515980  1319.639876
2015-01-07  1330.562125  1344.063112
2015-01-08  1343.268811  1364.126161


## Appending pandas Series

In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the 'Units' column from each and append them together with method chaining using .append().

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

In [23]:
# Import pandas
import pandas as pd

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\sales-jan-2015.csv',parse_dates=True,index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\sales-feb-2015.csv',parse_dates=True,index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('K:\\TensorflowPY36CPU\\TensorflowPY36CPU\\_1_PythonBasic\\Pandas\\sales-mar-2015.csv',parse_dates=True,index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015' ])

# Compute & print total sales in quarter1
print(quarter1.sum())


Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64
642


# Concatenating pandas Series along row axis

Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames jan, feb, and mar have been pre-loaded.

Your job is to use pd.concat() with a list of Series to achieve the same result that you would get by chaining calls to .append().

You may be wondering about the difference between pd.concat() and pandas' .append() method. One way to think of the difference is that .append() is a specific case of a concatenation, while pd.concat() gives you more flexibility, as you'll see in later exercises.


In [24]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units,axis='rows')

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


## Appending DataFrames with ignore_index
In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981 and names_1881 are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame .append() method to make a DataFrame combined_names. To distinguish rows from the original two DataFrames, you'll add a 'year' column to each with the year (1881 or 1981 in this case). In addition, you'll specify ignore_index=True so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

In [25]:
## init
names_1881 = pd.read_csv(path+'names1881.csv',header=None,names=['name','gender','count'])
names_1981 = pd.read_csv(path+'names1981.csv',header=None,names=['name','gender','count'])

In [26]:
names_1881.head()
names_1981.head()


Unnamed: 0,name,gender,count
0,Jennifer,F,57032
1,Jessica,F,42519
2,Amanda,F,34370
3,Sarah,F,28162
4,Melissa,F,28003


In [27]:
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981,ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names['name']=='Morgan'])


(19455, 4)
(1935, 4)
(21390, 4)
         name gender  count  year
1283   Morgan      M     23  1881
2096   Morgan      F   1769  1981
14390  Morgan      M    766  1981


## Concatenating pandas DataFrames along column axis
The function pd.concat() can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 or axis='columns'.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

The files 'quarterly_max_temp.csv' and 'monthly_mean_temp.csv' have been pre-loaded into the DataFrames weather_max and weather_mean respectively, and pandas has been imported as pd.

In [28]:
#init
weather_max = pd.read_csv(path+'quarterly_max_temp.csv')
weather_mean = pd.read_csv(path+'monthly_mean_temp.csv')
weather_max.head()
weather_mean.head()


Unnamed: 0,Month,Max TemperatureF
0,Jan,68
1,Apr,89
2,Jul,91
3,Oct,84


In [29]:
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean] ,axis=1)

# Print weather
print(weather)


    Month  Mean TemperatureF  Month  Max TemperatureF
0   Apr            53.100000  Jan                68.0
1   Aug            70.000000  Apr                89.0
2   Dec            34.935484  Jul                91.0
3   Feb            28.714286  Oct                84.0
4   Jan            32.354839    NaN               NaN
5   Jul            72.870968    NaN               NaN
6   Jun            70.133333    NaN               NaN
7   Mar            35.000000    NaN               NaN
8   May            62.612903    NaN               NaN
9   Nov            39.800000    NaN               NaN
10  Oct            55.451613    NaN               NaN
11  Sep            63.766667    NaN               NaN


## Reading multiple files to build a DataFrame

Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.

pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

In [30]:
# init

medals = []
medal_types = ['bronze', 'silver', 'gold']




In [31]:
for medal in medal_types:

    # Create the file name: file_name
    file_name = path+"%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name,header=0,index_col='Country',names=columns)

    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals
medals = pd.concat(medals,axis='columns')

# Print medals
print(medals)

                bronze  silver    gold
France           475.0   461.0     NaN
Germany          454.0     NaN   407.0
Italy              NaN   394.0   460.0
Soviet Union     584.0   627.0   838.0
United Kingdom   505.0   591.0   498.0
United States   1052.0  1195.0  2088.0


## Concatenating vertically to get MultiIndexed rows
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the keys parameter in the call to pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset. Once again, pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

In [32]:
# init

medals = []
medal_types = ['bronze', 'silver', 'gold']

In [33]:
for medal in medal_types:

    file_name = path+"%s_top5.csv" % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name,index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
    
# Concatenate medals: medals
medals = pd.concat(medals,keys=['bronze', 'silver', 'gold'] )

# Print medals in entirety
print(medals)

                        Total
       Country               
bronze United States   1052.0
       Soviet Union     584.0
       United Kingdom   505.0
       France           475.0
       Germany          454.0
silver United States   1195.0
       Soviet Union     627.0
       United Kingdom   591.0
       France           461.0
       Italy            394.0
gold   United States   2088.0
       Soviet Union     838.0
       United Kingdom   498.0
       Italy            460.0
       Germany          407.0




## This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices. Check out this exercise from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

pandas has been imported for you as pd and the DataFrame medals is already in your namespace.


In [34]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)


In [35]:
medals_sorted


Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Unnamed: 0_level_1,Country,Unnamed: 2_level_1
bronze,France,475.0
bronze,Germany,454.0
bronze,Soviet Union,584.0
bronze,United Kingdom,505.0
bronze,United States,1052.0
gold,Germany,407.0
gold,Italy,460.0
gold,Soviet Union,838.0
gold,United Kingdom,498.0
gold,United States,2088.0


In [36]:


# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

# Print data about silver medals
print(medals_sorted.loc['silver'])

# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:, 'United Kingdom'], :])


Total    454.0
Name: (bronze, Germany), dtype: float64
                 Total
Country               
France           461.0
Italy            394.0
Soviet Union     627.0
United Kingdom   591.0
United States   1195.0
                       Total
       Country              
bronze United Kingdom  505.0
gold   United Kingdom  498.0
silver United Kingdom  591.0


# Concatenating horizontally to get MultiIndexed columns
It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called dataframes. All three DataFrames contain 'Company', 'Product', and 'Units' columns with a 'Date' column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes Hardware transactions, the second describes Software transactions, and the third, Service transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

In [37]:
#init
Hardware = pd.read_csv(path+'feb-sales-Hardware.csv',index_col='Date')
Software = pd.read_csv(path+'feb-sales-Software.csv',index_col='Date')
Service= pd.read_csv(path+'feb-sales-Service.csv',index_col='Date')
dataframes = [Hardware,Software,Service]

dataframes


[                             Company   Product  Units
 Date                                                 
 2015-02-04 21:52:45  Acme Coporation  Hardware     14
 2015-02-07 22:58:10  Acme Coporation  Hardware      1
 2015-02-19 10:59:33        Mediacore  Hardware     16
 2015-02-02 20:54:49        Mediacore  Hardware      9
 2015-02-21 20:41:47            Hooli  Hardware      3,
                              Company   Product  Units
 Date                                                 
 2015-02-16 12:09:19            Hooli  Software     10
 2015-02-03 14:14:18          Initech  Software     13
 2015-02-02 08:33:01            Hooli  Software      3
 2015-02-05 01:53:06  Acme Coporation  Software     19
 2015-02-11 20:03:08          Initech  Software      7
 2015-02-09 13:09:55        Mediacore  Software      7
 2015-02-11 22:50:44            Hooli  Software      4
 2015-02-04 15:36:29        Streeplex  Software     13
 2015-02-21 05:01:26        Mediacore  Software      3,
        

In [38]:
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])


In [39]:
february.info()
print(february.head())


<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
Data columns (total 9 columns):
(Hardware, Company)    5 non-null object
(Hardware, Product)    5 non-null object
(Hardware, Units)      5 non-null float64
(Software, Company)    9 non-null object
(Software, Product)    9 non-null object
(Software, Units)      9 non-null float64
(Service, Company)     6 non-null object
(Service, Product)     6 non-null object
(Service, Units)       6 non-null float64
dtypes: float64(3), object(6)
memory usage: 1.6+ KB
                            Hardware                   Software            \
                             Company   Product Units    Company   Product   
2015-02-02 08:33:01              NaN       NaN   NaN      Hooli  Software   
2015-02-02 20:54:49        Mediacore  Hardware   9.0        NaN       NaN   
2015-02-03 14:14:18              NaN       NaN   NaN    Initech  Software   
2015-02-04 15:36:29              NaN       NaN   NaN  Stree

In [40]:

# Print february.info()
print(february.info())

# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
#Extract a slice called slice_2_8 from february (using .loc[] & idx) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under 'Company'.
slice_2_8 = february.loc['Feb 2,2015':'Feb 8,2015', idx[:, 'Company']]

# Print slice_2_8
print(slice_2_8)


<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 2015-02-02 08:33:01 to 2015-02-26 08:58:51
Data columns (total 9 columns):
(Hardware, Company)    5 non-null object
(Hardware, Product)    5 non-null object
(Hardware, Units)      5 non-null float64
(Software, Company)    9 non-null object
(Software, Product)    9 non-null object
(Software, Units)      9 non-null float64
(Service, Company)     6 non-null object
(Service, Product)     6 non-null object
(Service, Units)       6 non-null float64
dtypes: float64(3), object(6)
memory usage: 1.6+ KB
None
Empty DataFrame
Columns: [(Hardware, Company), (Software, Company), (Service, Company)]
Index: []


In [41]:
# Print slice_2_8
print(slice_2_8)

Empty DataFrame
Columns: [(Hardware, Company), (Software, Company), (Service, Company)]
Index: []


## Concatenating DataFrames from a dict

You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames jan, feb, and mar have been pre-loaded for you. Your task is to aggregate the sum of all sales over the 'Company' column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

## Concatenating DataFrames with inner join
Concatenating DataFrames with inner join
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.

The DataFrames bronze, silver, and gold have been pre-loaded for you.

Your task is to compute an inner join.

In [42]:
# Make the list of tuples: month_list
#Create a list called month_list consisting of the tuples ('january', jan), ('february', feb), and ('march', mar).
month_list = [('january', jan),('february', feb),('march', mar)]

# Create an empty dictionary: month_dict
month_dict = dict()

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])


                          Units
         Company               
february Acme Coporation     34
         Hooli               30
         Initech             30
         Mediacore           45
         Streeplex           37
january  Acme Coporation     76
         Hooli               70
         Initech             37
         Mediacore           15
         Streeplex           50
march    Acme Coporation      5
         Hooli               37
         Initech             68
         Mediacore           68
         Streeplex           40
                    Units
         Company         
february Mediacore     45
january  Mediacore     15
march    Mediacore     68


## Concatenating DataFrames with inner join
Concatenating DataFrames with inner join
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.

The DataFrames bronze, silver, and gold have been pre-loaded for you.

Your task is to compute an inner join.

In [43]:
# Create the list of DataFrames: medal_list
#Construct a list of DataFrames called medal_list with entries bronze, silver, and gold.
medal_list = [bronze,silver,gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list,keys=['bronze', 'silver', 'gold'],axis=1,join='inner')

# Print medals
print(medals)


    bronze                                silver                         \
       NOC                Country   Total    NOC                Country   
0      USA          United States  1052.0    USA          United States   
1      URS           Soviet Union   584.0    URS           Soviet Union   
2      GBR         United Kingdom   505.0    GBR         United Kingdom   
3      FRA                 France   475.0    FRA                 France   
4      GER                Germany   454.0    GER                Germany   
5      AUS              Australia   413.0    AUS              Australia   
6      ITA                  Italy   374.0    ITA                  Italy   
7      HUN                Hungary   345.0    HUN                Hungary   
8      SWE                 Sweden   325.0    SWE                 Sweden   
9      NED            Netherlands   320.0    NED            Netherlands   
10     ROU                Romania   282.0    ROU                Romania   
11     JPN               

## Resampling & concatenating DataFrames with inner join


In [44]:
#init
china = pd.read_csv(path+'china.csv',index_col='Year',parse_dates=True)
us = pd.read_csv(path+'usa.csv',index_col='Year',parse_dates=True)

china.head()
us.head()


Unnamed: 0_level_0,US
Year,Unnamed: 1_level_1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2
1948-04-01,272.9


In [45]:
# Resample and tidy china: china_annual
china_annual = china.resample('A').pct_change(10).dropna()


.resample() is now a deferred operation
You called pct_change(...) on this deferred object which materialized it into a dataframe
by implicitly taking the mean.  Use .resample(...).mean() instead
  


In [46]:


# Resample and tidy us: us_annual
us_annual = us.resample('A').pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual,us_annual],axis=1,join='inner')

# Resample gdp and print
print(gdp.resample('10A').last())


Empty DataFrame
Columns: [China, US]
Index: []


.resample() is now a deferred operation
You called pct_change(...) on this deferred object which materialized it into a dataframe
by implicitly taking the mean.  Use .resample(...).mean() instead
  after removing the cwd from sys.path.


## Merging on a specific column


This exercise follows on the last one with the DataFrames revenue and managers for your company. You expect your company to grow and, eventually, to operate in cities with the same name on different states. As such, you decide that every branch should have a numerical branch identifier. Thus, you add a branch_id column to both DataFrames. Moreover, new cities have been added to both the revenue and managers DataFrames as well. pandas has been imported as pd and both DataFrames are available in your namespace.

At present, there should be a 1-to-1 relationship between the city and branch_id fields. In that case, the result of a merge on the city columns ought to give you the same output as a merge on the branch_id columns. Do they? Can you spot an ambiguity in one of the DataFrames?

In [58]:
#init 

managers = pd.read_csv(path+'managers.csv')
revenue = pd.read_csv(path+'revenue.csv')


In [59]:
managers


Unnamed: 0,branch_id,city,manager
0,10,Austin,Charlers
1,20,Denver,Joel
2,47,Mendocino,Brett
3,31,Springfield,Sally


In [60]:
revenue


Unnamed: 0,branch_id,city,revenue
0,10,Austin,100
1,20,Denver,83
2,30,Springfield,4
3,47,Mendocino,200


In [61]:
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue,managers,on='city')

# Print merge_by_city
print(merge_by_city)

# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue,managers,on='branch_id')

# Print merge_by_id
print(merge_by_id)


   branch_id_x         city  revenue  branch_id_y   manager
0           10  Austin           100           10  Charlers
1           20  Denver            83           20      Joel
2           30  Springfield        4           31     Sally
3           47  Mendocino        200           47     Brett
   branch_id       city_x  revenue       city_y   manager
0         10  Austin           100  Austin       Charlers
1         20  Denver            83  Denver           Joel
2         47  Mendocino        200  Mendocino       Brett


## Merging on columns with non-matching labels
You continue working with the revenue & managers DataFrames from before. This time, someone has changed the field name 'city' to 'branch' in the managers table. Now, when you attempt to merge DataFrames, an exception is thrown:

>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
Given this, it will take a bit more work for you to join or merge on the city/branch name. You have to specify the left_on and right_on parameters in the call to pd.merge().

As before, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace. They have been printed in the IPython Shell so you can examine the columns prior to merging.

Are you able to merge better than in the last exercise? How should the rows with Springfield be handled?

In [63]:
#init

managers = pd.read_csv(path+'managers2.csv')
revenue = pd.read_csv(path+'revenue2.csv')

In [64]:
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue,managers,left_on='city',right_on='branch')

# Print combined
print(combined)


   branch_id         city  revenue     state_x       branch  branch_id    \
0         10  Austin               100      TX  Austin                10   
1         20  Denver                83      CO  Denver                20   
2         30  Springfield            4      IL  Springfield           31   
3         47  Mendocino            200      CA  Mendocino             47   

    manager state_y  
0  Charlers      TX  
1  Joel          CO  
2  Sally         MO  
3  Brett         CA  


## Merging on multiple columns

Another strategy to disambiguate cities with identical names is to add information on the states in which the cities are located. To this end, you add a column called state to both DataFrames from the preceding exercises. Again, pandas has been pre-imported as pd and the revenue and managers DataFrames are in your namespace.

Your goal in this exercise is to use pd.merge() to merge DataFrames using multiple columns (using 'branch_id', 'city', and 'state' in this case).

Are you able to match all your company's branches correctly?


In [67]:
managers = pd.read_csv(path+'managers.csv')
revenue = pd.read_csv(path+'revenue.csv')

In [69]:
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue,managers,on=['branch_id', 'city', 'state'])

# Print combined
print(combined)

   branch_id         city  revenue state   manager
0         10  Austin           100    TX  Charlers
1         20  Denver            83    CO      Joel
2         47  Mendocino        200    CA     Brett


## Which join should I use?
 - df1.append(df2) : stack vertically
 - pd.concat([df1,df2])
    stack vertical or horizontally
    simple inner/out join on indexes
 - df1.join(df2): inner/outer/left/right join on indexes
 - pd.merage([df1,df2]): many joins on multiple columns




## Left & right merging on multiple columns
You now have, in addition to the revenue and managers DataFrames from prior exercises, a DataFrame sales that summarizes units sold from specific branches (identified by city and state but not branch_id).

Once again, the managers DataFrame uses the label branch in place of city as in the other two DataFrames. Your task here is to employ left and right merges to preserve data and identify where data is missing.

By merging revenue and sales with a right merge, you can identify the missing revenue values. Here, you don't need to specify left_on or right_on because the columns to merge on have matching labels.

By merging sales and managers with a left merge, you can identify the missing manager. Here, the columns to merge on have conflicting labels, so you must specify left_on and right_on. In both cases, you're looking to figure out how to connect the fields in rows containing Springfield.

pandas has been imported as pd and the three DataFrames revenue, managers, and sales have been pre-loaded. They have been printed for you to explore in the IPython Shell.

In [5]:
#init

import pandas as pd

managers = pd.read_csv(path+'managers2.csv')


In [6]:
managers.head()


Unnamed: 0,branch,branch_id,manager,state
0,Austin,10,Charlers,TX
1,Denver,20,Joel,CO
2,Mendocino,47,Brett,CA
3,Springfield,31,Sally,MO


In [7]:

revenue = pd.read_csv(path+'revenue2.csv')


In [8]:
revenue.head()


Unnamed: 0,branch_id,city,revenue,state
0,10,Austin,100,TX
1,20,Denver,83,CO
2,30,Springfield,4,IL
3,47,Mendocino,200,CA


In [9]:

sales = pd.read_csv(path+'sales2.csv')
sales.head()


Unnamed: 0,city,state,units
0,Mendocino,CA,1
1,Denver,CO,4
2,Austin,TX,2
3,Springfield,MO,5
4,Springfield,IL,1


In [10]:
# Merge revenue and sales: revenue_and_sales
#Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame revenue_and_sales.
#Use how='right' and on=['city', 'state'].
revenue_and_sales = pd.merge(revenue,sales,how='right',on=['city', 'state'])

# Print revenue_and_sales
print(revenue_and_sales)


KeyError: 'state'

In [11]:
# Merge sales and managers: sales_and_managers
sales_and_managers =pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
print(sales_and_managers)


KeyError: 'state'

## Merging DataFrames with outer join
This exercise picks up where the previous one left off. The DataFrames revenue, managers, and sales are pre-loaded into your namespace (and, of course, pandas is imported as pd). Moreover, the merged DataFrames revenue_and_sales and sales_and_managers have been pre-computed exactly as you did in the previous exercise.

The merged DataFrames contain enough information to construct a DataFrame with 5 rows with all known information correctly aligned and each branch listed only once. You will try to merge the merged DataFrames on all matching keys (which computes an inner join by default). You can compare the result to an outer join and also to an outer join with restricted subset of columns as keys.

In [15]:
#init 

sales_and_managers = pd.read_csv(path+'sales_and_managers.csv')
sales_and_managers.head()


Unnamed: 0,city,state,units,branch,branch_id,manager
0,Mendocino,CA,1 Mendocino,47.0,Brett,
1,Denver,CO,4 Denver,20.0,Joel,
2,Austin,TX,2 Austin,10.0,Charlers,
3,Springfield,MO,5 Springfield,31.0,Sally,
4,Springfield,IL,1 NaN,,,


In [18]:
revenue_and_sales = pd.read_csv(path+'revenue_and_sales.csv')
revenue_and_sales.head()


Empty DataFrame
Columns: [city       , state, units        , branch, branch_id, manager, revenue, units]
Index: []
   city         state  units          branch  branch_id  manager  revenue  \
0  Mendocino    CA     1   Mendocino  47.0        Brett      NaN      NaN   
1  Denver       CO     4      Denver  20.0         Joel      NaN      NaN   
2  Austin       TX     2      Austin  10.0     Charlers      NaN      NaN   
3  Springfield  MO     5 Springfield  31.0        Sally      NaN      NaN   
4  Springfield  IL     1         NaN  NaN           NaN      NaN      NaN   
5  Austin       TX               NaN     NaN  10.0           NaN  100.0     
6  Denver       CO               NaN     NaN  20.0           NaN  83.0      
7  Springfield  IL               NaN     NaN  30.0           NaN  4.0       
8  Mendocino    CA               NaN     NaN  47.0           NaN  200.0     
9  Springfield  MO               NaN     NaN  NaN            NaN  NaN       

   units  
0    NaN  
1    NaN  
2   

KeyError: 'city'

In [19]:
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers,revenue_and_sales)

# Print merge_default
print(merge_default)

# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers,revenue_and_sales,how='outer')

# Print merge_outer
print(merge_outer)

# Perform the third merge: merge_outer_on
merge_outer_on =pd.merge(sales_and_managers,revenue_and_sales,on=['city','state'],how='outer')

# Print merge_outer_on
print(merge_outer_on)


Empty DataFrame
Columns: [city       , state, units        , branch, branch_id, manager, revenue, units]
Index: []
   city         state  units          branch  branch_id  manager  revenue  \
0  Mendocino    CA     1   Mendocino  47.0        Brett      NaN      NaN   
1  Denver       CO     4      Denver  20.0         Joel      NaN      NaN   
2  Austin       TX     2      Austin  10.0     Charlers      NaN      NaN   
3  Springfield  MO     5 Springfield  31.0        Sally      NaN      NaN   
4  Springfield  IL     1         NaN  NaN           NaN      NaN      NaN   
5  Austin       TX               NaN     NaN  10.0           NaN  100.0     
6  Denver       CO               NaN     NaN  20.0           NaN  83.0      
7  Springfield  IL               NaN     NaN  30.0           NaN  4.0       
8  Mendocino    CA               NaN     NaN  47.0           NaN  200.0     
9  Springfield  MO               NaN     NaN  NaN            NaN  NaN       

   units  
0    NaN  
1    NaN  
2   

KeyError: 'city'