# Merging DataFrames with Pandas

In [1]:
import pandas as pd
from glob import glob

## Reading multiple data files

### Tools for pandas data import

* pd.read_csv() for CSV files
    * dataframe = pd.read_csv(filepath)
    * dozens of optional input parameters
* Other data import tools:
    * pd.read_excel()
    * pd.read_html()
    * pd.read_json()

In [2]:
# loading separate files
dataframe0 = pd.read_csv('data/Sales/sales-jan-2015.csv')
dataframe1 = pd.read_csv('data/Sales/sales-feb-2015.csv')

In [3]:
# Using for loops
filenames = ['data/Sales/sales-jan-2015.csv', 'data/Sales/sales-feb-2015.csv']
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))

In [4]:
# Using a list comprehension
dataframes = [pd.read_csv(f) for f in filenames]

In [5]:
# Using glob
filenames = glob('data/Sales/sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

### Reading DataFrames from multiple files

When data is spread among several files, you usually invoke pandas' read_csv() (or a similar data import function) multiple times to load the data into several DataFrames.

The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the [Guardian](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

The column labels of each DataFrame are NOC, Country, & Total where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

In [6]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv('data/summer_Olympic_medals/Bronze.csv')

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv('data/summer_Olympic_medals/Silver.csv')

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv('data/summer_Olympic_medals/Gold.csv')

# Print the first five rows of gold
print(gold.head())

   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


### Reading DataFrames from multiple files in a loop

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

In [7]:
# Create the list of file names: filenames
filenames = ['data/summer_Olympic_medals/Gold.csv', 'data/summer_Olympic_medals/Silver.csv', 'data/summer_Olympic_medals/Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())

   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


### Combining DataFrames from multiple data files

In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later on in the course, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

In [8]:
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

# Print the head of medals
print(medals.head())

   NOC         Country    Gold  Silver  Bronze
0  USA   United States  2088.0  1195.0  1052.0
1  URS    Soviet Union   838.0   627.0   584.0
2  GBR  United Kingdom   498.0   591.0   505.0
3  FRA          France   378.0   461.0   475.0
4  GER         Germany   407.0   350.0   454.0


## Reindexing DataFrames

**“Indexes” vs. “Indices”**
* indices: many index labels within Index data structures (entries)
* indexes: many pandas Index data structures (fisrt column)

```python
# Importing weather data
import pandas as pd 
w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
w_max = pd.read_csv('quarterly_max_temp.csv', index_col='Month')

# Examining the data
In [4]: print(w_mean)
Month Mean TemperatureF
Apr 61.956044
Jan 32.133333
Jul 68.934783
Oct 43.434783
In [5]: print(w_max)
Month Mean TemperatureF
Jan 68
Apr 89
Jul 91
Oct 84

# DataFrame indexes
In [6]: print(w_mean.index)
Index(['Apr', 'Jan', 'Jul', 'Oct'], dtype='object', name='Month')
In [7]: print(w_max.index)
Index(['Jan', 'Apr', 'Jul', 'Oct'], dtype='object', name='Month')
In [8]: print(type(w_mean.index))
<class 'pandas.indexes.base.Index'>

# Using .reindex()
In [9]: ordered = ['Jan', 'Apr', 'Jul', 'Oct']
In [10]: w_mean2 = w_mean.reindex(ordered)
In [11]: print(w_mean2)
Month Mean TemperatureF
Jan 32.133333
Apr 61.956044
Jul 68.934783
Oct 43.434783

# Using .sort_index()
In [12]: w_mean2.sort_index()
Out[12]:
Month Mean TemperatureF
Apr 61.956044
Jan 32.133333
Jul 68.934783
Oct 43.434783

# Reindex from a DataFrame Index
In [13]: w_mean.reindex(w_max.index)
Out[13]:
Month Mean TemperatureF
Jan 32.133333
Apr 61.956044
Jul 68.934783
Oct 43.434783

# Reindexing with missing labels
In [14]: w_mean3 = w_mean.reindex(['Jan', 'Apr', 'Dec'])
In [15]: print(w_mean3)
Month Mean TemperatureF
Jan 32.133333
Apr 61.956044
Dec NaN

# Reindex from a DataFrame Index
In [16]: w_max.reindex(w_mean3.index)
Out[16]:
 Max TemperatureF
Month
Jan 68.0
Apr 89.0
Dec NaN
In [17]: w_max.reindex(w_mean3.index).dropna()
Out[17]:
 Max TemperatureF
Month
Jan 68.0
Apr 89.0

# ORDER MATTERS
In [18]: w_max.reindex(w_mean.index)
Out[18]:
 Max TemperatureF
Month
Apr 89
Jan 68
Jul 91
Oct 84
In [19]: w_mean.reindex(w_max.index)
Out[19]:
 Mean TemperatureF
Month
Jan 32.133333
Apr 61.956044
Jul 68.934783
Oct 43.434783
```


### Sorting DataFrame with the Index & columns

It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are .sort_index() and .sort_values().

In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.

```python
# Import pandas
import pandas as pd

# Read 'monthly_max_temp.csv' into a DataFrame: weather1
weather1 = pd.read_csv('monthly_max_temp.csv', index_col='Month')

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values('Max TemperatureF')

# Print the head of weather4
print(weather4.head())
```

### Reindexing DataFrame from a list

Sorting methods are not the only way to change DataFrame Indexes. There is also the .reindex() method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples, which you may recall from the pandas Foundations course).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Sep. This data has been loaded into a DataFrame called weather1 and has been printed in its entirety in the IPython Shell. Notice it has only four rows (corresponding to the first month of each quarter) and that the rows are not sorted chronologically.

You'll initially use a list of all twelve month abbreviations and subsequently apply the .ffill() method to forward-fill the null entries when upsampling. This list of month abbreviations has been pre-loaded as year.

```python
# Import pandas
import pandas as pd

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
print(weather2)

# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
print(weather3)
```

### Reindexing using another DataFrame Index

Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births registered in the US since 1881. In this exercise, you will start with two baby-names DataFrames names_1981 and names_1881 loaded for you.

The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:

In [9]:
names_1981 = pd.read_csv('data/baby_names/names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('data/baby_names/names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))

In [10]:
names_1881.shape

(1935, 1)

In [11]:
names_1981.shape

(19455, 1)

As you can see by looking at their shapes the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame common_names counting names from 1881 that were still popular in 1981.

In [12]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
print(common_names.shape)

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
print(common_names.shape)

(1935, 1)
(1587, 1)


## Arithmetic with Series & DataFrames

In [13]:
# Loading weather data
weather = pd.read_csv('data/pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-7-1':'2013-7-7', 'PrecipitationIn']

Date
2013-07-01    0.18
2013-07-02    0.14
2013-07-03    0.00
2013-07-04    0.25
2013-07-05    0.02
2013-07-06    0.06
2013-07-07    0.10
Name: PrecipitationIn, dtype: float64

In [14]:
# Scalar multiplication
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54

Date
2013-07-01    0.4572
2013-07-02    0.3556
2013-07-03    0.0000
2013-07-04    0.6350
2013-07-05    0.0508
2013-07-06    0.1524
2013-07-07    0.2540
Name: PrecipitationIn, dtype: float64

In [15]:
# Absolute temperature range
week1_range = weather.loc['2013-07-01':'2013-07-07',['Min TemperatureF', 'Max TemperatureF']]
print(week1_range)

            Min TemperatureF  Max TemperatureF
Date                                          
2013-07-01                66                79
2013-07-02                66                84
2013-07-03                71                86
2013-07-04                70                86
2013-07-05                69                86
2013-07-06                70                89
2013-07-07                70                77


In [16]:
# Average temperature
week1_mean = weather.loc['2013-07-01':'2013-07-07', 'Mean TemperatureF']
print(week1_mean)

Date
2013-07-01    72
2013-07-02    74
2013-07-03    78
2013-07-04    77
2013-07-05    76
2013-07-06    78
2013-07-07    72
Name: Mean TemperatureF, dtype: int64


In [17]:
# Relative temperature range
week1_range / week1_mean

  return this.join(other, how=how, return_indexers=return_indexers)


Unnamed: 0_level_0,2013-07-01 00:00:00,2013-07-02 00:00:00,2013-07-03 00:00:00,2013-07-04 00:00:00,2013-07-05 00:00:00,2013-07-06 00:00:00,2013-07-07 00:00:00,Min TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-07-01,,,,,,,,,
2013-07-02,,,,,,,,,
2013-07-03,,,,,,,,,
2013-07-04,,,,,,,,,
2013-07-05,,,,,,,,,
2013-07-06,,,,,,,,,
2013-07-07,,,,,,,,,


In [18]:
week1_range.divide(week1_mean, axis='rows')

Unnamed: 0_level_0,Min TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-07-01,0.916667,1.097222
2013-07-02,0.891892,1.135135
2013-07-03,0.910256,1.102564
2013-07-04,0.909091,1.116883
2013-07-05,0.907895,1.131579
2013-07-06,0.897436,1.141026
2013-07-07,0.972222,1.069444


In [19]:
# Percentage changes
week1_mean.pct_change() * 100

Date
2013-07-01         NaN
2013-07-02    2.777778
2013-07-03    5.405405
2013-07-04   -1.282051
2013-07-05   -1.298701
2013-07-06    2.631579
2013-07-07   -7.692308
Name: Mean TemperatureF, dtype: float64

In [20]:
# Bronze Olympic medals
bronze = pd.read_csv('data/summer_Olympic_medals/bronze_top5.csv', index_col=0)
print(bronze) 

                 Total
Country               
United States   1052.0
Soviet Union     584.0
United Kingdom   505.0
France           475.0
Germany          454.0


In [21]:
# Silver Olympic medals
silver = pd.read_csv('data/summer_Olympic_medals/silver_top5.csv', index_col=0)
print(silver)

                 Total
Country               
United States   1195.0
Soviet Union     627.0
United Kingdom   591.0
France           461.0
Italy            394.0


In [22]:
# Gold Olympic medals
gold = pd.read_csv('data/summer_Olympic_medals/gold_top5.csv', index_col=0)
print(gold)

                 Total
Country               
United States   2088.0
Soviet Union     838.0
United Kingdom   498.0
Italy            460.0
Germany          407.0


In [23]:
# Adding bronze, silver
bronze + silver # gives NaN in certain rows

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,
Italy,
Soviet Union,1211.0
United Kingdom,1096.0
United States,2247.0


In [24]:
bronze.add(silver) # still gives NaN in certain rows

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,
Italy,
Soviet Union,1211.0
United Kingdom,1096.0
United States,2247.0


In [25]:
# Using a fill_value
bronze.add(silver, fill_value=0)

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,454.0
Italy,394.0
Soviet Union,1211.0
United Kingdom,1096.0
United States,2247.0


In [26]:
# Chaining .add()
bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,861.0
Italy,854.0
Soviet Union,2049.0
United Kingdom,1594.0
United States,4335.0


### Broadcasting in arithmetic formulas

In this exercise, you'll work with weather data pulled from wunderground.com. The DataFrame weather has been pre-loaded along with pandas as pd. It has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like +, -, *, and /) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays

In [27]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = ['Min TemperatureC', 'Mean TemperatureC', 'Max TemperatureC']

# Print first 5 rows of temps_c
print(temps_c.head())

            Min TemperatureC  Mean TemperatureC  Max TemperatureC
Date                                                             
2013-01-01         -6.111111          -2.222222          0.000000
2013-01-02         -8.333333          -6.111111         -3.888889
2013-01-03         -8.888889          -4.444444          0.000000
2013-01-04         -2.777778          -2.222222         -1.111111
2013-01-05         -3.888889          -1.111111          1.111111


### Computing percentage growth of GDP

Your job in this exercise is to compute the yearly percent-change of US GDP (Gross Domestic Product) since 2008.

The data has been obtained from the Federal Reserve Bank of St. Louis and is available in the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP.

In [28]:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('data/GDP/gdp_usa.csv', parse_dates=True, index_col='DATE')

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]

# Print the last 8 rows of post2008
print(post2008.tail(8))

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
print(yearly)

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
print(yearly)

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5
              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5
              VALUE    growth
DATE                         
2008-12-31  14549.9       NaN
2009-12-31  14566.5  0.114090
2010-12-31  15230.2  4.556345
2011-12-31  15785.3  3.644732
2012-12-31  16297.3  3.243524
2013-12-31  16999.9  4.311144
2014-12-31  17692.2  4.072377
2015-12-31  18222.8  2.999062
2016-12-31  18436.5  1.172707


### Converting currency of stocks

In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates are both provided to you.

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

In [29]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('data/sp500.csv', parse_dates=True, index_col='Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('data/exchange.csv', parse_dates=True, index_col='Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')

# Print the head of pounds
print(pounds.head())

                   Open        Close
Date                                
2015-01-02  2058.899902  2058.199951
2015-01-05  2054.439941  2020.579956
2015-01-06  2022.150024  2002.609985
2015-01-07  2005.550049  2025.900024
2015-01-08  2030.609985  2062.139893
                   Open        Close
Date                                
2015-01-02  1340.364425  1339.908750
2015-01-05  1348.616555  1326.389506
2015-01-06  1332.515980  1319.639876
2015-01-07  1330.562125  1344.063112
2015-01-08  1343.268811  1364.126161


## Appending & Concatenating Series

**append()**
* .append(): Series & DataFrame method
* Invocation:
    * s1.append(s2)
* Stacks rows of s2 below s1
* Method for Series & DataFrames

**concat()**
* concat(): pandas module function
* Invocation:
    * pd.concat([s1, s2, s3])
* Can stack row-wise or column-wise

**concat() & .append()**
* Equivalence of concat() & .append():
    * result1 = pd.concat([s1, s2, s3])
    * result2 = s1.append(s2).append(s3)
* result1 == result2 elementwise

In [30]:
northeast = pd.Series(['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'])
south = pd.Series(['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'])
midwest = pd.Series(['IL', 'IN', 'MN', 'MO', 'NE', 'ND','SD', 'IA', 'KS', 'MI', 'OH', 'WI'])
west = pd.Series(['AZ', 'CO', 'ID', 'MT', 'NV', 'NM','UT', 'WY', 'AK', 'CA', 'HI', 'OR','WA'])

In [31]:
# Using .append()
east = northeast.append(south)
print(east) 

0     CT
1     ME
2     MA
3     NH
4     RI
5     VT
6     NJ
7     NY
8     PA
0     DE
1     FL
2     GA
3     MD
4     NC
5     SC
6     VA
7     DC
8     WV
9     AL
10    KY
11    MS
12    TN
13    AR
14    LA
15    OK
16    TX
dtype: object


In [32]:
# The appended Index, problems!
print(east.index)
print(east.loc[3]) 

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  0,  1,  2,  3,  4,  5,  6,  7,
             8,  9, 10, 11, 12, 13, 14, 15, 16],
           dtype='int64')
3    NH
3    MD
dtype: object


In [33]:
# Using .reset_index()
new_east = northeast.append(south).reset_index(drop=True)
print(new_east.head(11)) 

0     CT
1     ME
2     MA
3     NH
4     RI
5     VT
6     NJ
7     NY
8     PA
9     DE
10    FL
dtype: object


In [34]:
print(new_east.index)

RangeIndex(start=0, stop=26, step=1)


In [35]:
# Using concat()
east = pd.concat([northeast, south])
print(east.head(11))

0    CT
1    ME
2    MA
3    NH
4    RI
5    VT
6    NJ
7    NY
8    PA
0    DE
1    FL
dtype: object


In [36]:
print(east.index)

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  0,  1,  2,  3,  4,  5,  6,  7,
             8,  9, 10, 11, 12, 13, 14, 15, 16],
           dtype='int64')


In [37]:
# Using ignore_index
new_east = pd.concat([northeast, south],ignore_index=True)
print(new_east.head(11))

0     CT
1     ME
2     MA
3     NH
4     RI
5     VT
6     NJ
7     NY
8     PA
9     DE
10    FL
dtype: object


In [38]:
print(new_east.index)

RangeIndex(start=0, stop=26, step=1)


### Appending pandas Series

In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the 'Units' column from each and append them together with method chaining using .append().

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

In [39]:
# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('data/Sales/sales-jan-2015.csv', parse_dates=True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('data/Sales/sales-feb-2015.csv', parse_dates=True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('data/Sales/sales-mar-2015.csv', parse_dates=True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
print(quarter1.sum())

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64
642


### Concatenating pandas Series along row axis

Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames jan, feb, and mar have been pre-loaded.

Your job is to use pd.concat() with a list of Series to achieve the same result that you would get by chaining calls to .append().

You may be wondering about the difference between pd.concat() and pandas' .append() method. One way to think of the difference is that .append() is a specific case of a concatenation, while pd.concat() gives you more flexibility, as you'll see in later exercises.

In [40]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


## Appending & concatenating DataFrames

```python
In [1]: import pandas as pd
In [2]: pop1 = pd.read_csv('population_01.csv', index_col=0)
In [3]: pop2 = pd.read_csv('population_02.csv', index_col=0)
In [4]: print(type(pop1), pop1.shape)
<class 'pandas.core.frame.DataFrame'> (4, 1)
In [5]: print(type(pop2), pop2.shape)
<class 'pandas.core.frame.DataFrame'> (4, 1)

# Examining population data
In [6]: print(pop1)
      2010 Census Population
Zip Code ZCTA
66407 479
72732 4716
50579 2405
46241 30670
In [7]: print(pop2)
      2010 Census Population
Zip Code ZCTA
12776 2180
76092 26669
98360 12221
49464 27481

# Appending population DataFrames
In [8]: pop1.append(pop2)
Out[8]:
      2010 Census Population
Zip Code ZCTA
66407 479
72732 4716
50579 2405
46241 30670
12776 2180
76092 26669
98360 12221
49464 27481
In [9]: print(pop1.index.name, pop1.columns)
Zip Code ZCTA Index(['2010 Census Population'], dtype='object')
In [10]: print(pop2.index.name, pop2.columns)
Zip Code ZCTA Index(['2010 Census Population'], dtype='object')

# Population & unemployment data
In [11]: population = pd.read_csv('population_00.csv', index_col=0)
In [12]: unemployment = pd.read_csv('unemployment_00.csv', index_col=0)
In [13]: print(population)
      2010 Census Population
Zip Code ZCTA
57538 322
59916 130
37660 40038
2860 45199
In [14]: print(unemployment)
     unemployment participants
Zip
2860 0.11 34447
46167 0.02 4800
1097 0.33 42
80808 0.07 4310

#Appending population & unemployment
In [15]: population.append(unemployment)
Out[15]:
 2010 Census Population participants unemployment
57538 322.0 NaN NaN
59916 130.0 NaN NaN
37660 40038.0 NaN NaN
2860 45199.0 NaN NaN
2860 NaN 34447.0 0.11
46167 NaN 4800.0 0.02
1097 NaN 42.0 0.33
80808 NaN 4310.0 0.07

# There is a lot of NaNs and a repeated index

# Concatenating rows
In [16]: pd.concat([population, unemployment], axis=0)
Out[16]:
 2010 Census Population participants unemployment
57538 322.0 NaN NaN
59916 130.0 NaN NaN
37660 40038.0 NaN NaN
2860 45199.0 NaN NaN
2860 NaN 34447.0 0.11
46167 NaN 4800.0 0.02
1097 NaN 42.0 0.33
80808 NaN 4310.0 0.07

# Concatenating columns
In [17]: pd.concat([population, unemployment], axis=1)
Out[17]:
 2010 Census Population unemployment participants
1097 NaN 0.33 42.0
2860 45199.0 0.11 34447.0
37660 40038.0 NaN NaN
46167 NaN 0.02 4800.0
57538 322.0 NaN NaN
59916 130.0 NaN NaN
80808 NaN 0.07 4310.0
```

