# Data Science with pandas (part 2)

## Combining data frames

In many real life cases, you may find data saved into different files and, therefore, you may need to deal with several different pandas DataFrames. In the previous session, we saw how can we easily run statistical analysis on a single DataFrame, so that, ideally, we would like to have all the relevant data for our analysis inside a single DataFrame. <br>
In this session we will explore different ways of combining DataFrames into a single DataFrame.

Let's start loading the pandas library, reading two data sets into pandas DataFrames, and having a quick look at the tabular data: ```surveys.csv``` and ```species.csv```

In [1]:
import pandas as pd

In [2]:
surveys_df = pd.read_csv("../data/surveys.csv", keep_default_na=False, na_values=[""])
species_df = pd.read_csv("../data/species.csv", keep_default_na=False, na_values=[""])

In [37]:
print(surveys_df.info())
print('='*72)
surveys_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   record_id        35549 non-null  int64  
 1   month            35549 non-null  int64  
 2   day              35549 non-null  int64  
 3   year             35549 non-null  int64  
 4   plot_id          35549 non-null  int64  
 5   species_id       34786 non-null  object 
 6   sex              33038 non-null  object 
 7   hindfoot_length  31438 non-null  float64
 8   weight           32283 non-null  float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB
None


Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


In [38]:
print(species_df.info())
print('='*72)
species_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   species_id  54 non-null     object
 1   genus       54 non-null     object
 2   species     54 non-null     object
 3   taxa        54 non-null     object
dtypes: object(4)
memory usage: 1.8+ KB
None


Unnamed: 0,species_id,genus,species,taxa
0,AB,Amphispiza,bilineata,Bird
1,AH,Ammospermophilus,harrisi,Rodent
2,AS,Ammodramus,savannarum,Bird
3,BA,Baiomys,taylori,Rodent
4,CB,Campylorhynchus,brunneicapillus,Bird


### Concatenating DataFrames

The first way we will combine DataFrames is **concatenation**, i.e. simply putting DataFrames one after the other either **verically** or **horizontally**. To concatenate two DataFrames you will use the function ```pd.concat```, specifying as arguments the DataFrames to concatenate and ```axis=0``` or ```axis=1``` for vertical or horizontal concatenation, respectively.

To play a bit with DataFrame concatenation, we will use subsets of the DataFrames we just read. In particular, we will work with two sub-DataFrames obtained selecting the first and the last 10 rows of the ```surveys.csv``` dataset.

In [43]:
# Subsetting data frames
surveys_df_sub_first10 = surveys_df.head(10)
surveys_df_sub_last10  = surveys_df.tail(10)

Let's start with vertical stacking. In this case the two DataFrames are simply stacked on top of each other (remember to specify ```axis=0```).
<div>
<img src="pictures/vertical_stacking.jpeg" width="300"/>
</div>

In [44]:
# Stack the DataFrames on top of each other
vertical_stack = pd.concat([surveys_df_sub_first10, surveys_df_sub_last10], axis=0)

In [46]:
print(vertical_stack.info())
vertical_stack

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 35548
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   record_id        20 non-null     int64  
 1   month            20 non-null     int64  
 2   day              20 non-null     int64  
 3   year             20 non-null     int64  
 4   plot_id          20 non-null     int64  
 5   species_id       19 non-null     object 
 6   sex              16 non-null     object 
 7   hindfoot_length  15 non-null     float64
 8   weight           6 non-null      float64
dtypes: float64(2), int64(5), object(2)
memory usage: 1.6+ KB
None


Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


The resulting DataFrame (```vertical_stack```) consists, as expected, of 20 rows. These are the result of the first and last 10 rows of out original DataFrame ```surveys_df```. You may have noticed that the last ten rows have very high index, not consecutive with the first ten rows. This is because concatenation preserves the indices of the two original DataFrames. If you want a brand new set of indices for your concateneted DataFrame, simply resets the indices using the method ```.reset_index()```.

In [47]:
vertical_stack.reset_index()

Unnamed: 0,index,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,0,1,7,16,1977,2,NL,M,32.0,
1,1,2,7,16,1977,3,NL,M,33.0,
2,2,3,7,16,1977,2,DM,F,37.0,
3,3,4,7,16,1977,7,DM,M,36.0,
4,4,5,7,16,1977,3,DM,M,35.0,
5,5,6,7,16,1977,1,PF,M,14.0,
6,6,7,7,16,1977,2,PE,F,,
7,7,8,7,16,1977,1,DM,M,37.0,
8,8,9,7,16,1977,1,DM,F,34.0,
9,9,10,7,16,1977,6,PF,F,20.0,


<div class="alert alert-block alert-warning">
<b>TRY IT YOURSELF</b>: In the given example of vertical concatenation, you concatenated two DataFrame with the same columns. What would happen if the two DataFrames to concatenate have different column number and names?
    <ol>
        <li>Create a new DataFrame using the last 10 rows of the species DataFrame</li>
        <li>Concatenate vertically ```surveys_df_sub_first_10``` and your just created DataFrame</li>
        <li>Print the concatenated DataFrame info on the screen. How may rows does it have? What happened to the columns? Can you tell, finally, what happens when you concatenate two DataFrames with different columns?
    </ol>
</div>