# DataFrame concatenation

There are a lot of situations when you'll want to stick data with the same columns of data together. This often comes up for me when I want to take data from multiple files and put it all together into one file, or just analyze or visualize it together in a single DataFrame.

We'll get to this functionality, but along the way we'll learn how to

- get a random sample of DataFrame rows
- drop (delete) a column
- drop duplicate rows

In [1]:
import pandas as pd

## Basic concatenation

To show the basic principle we're just going to put a DataFrame together with some random samples from itself.

In [2]:
df = pd.read_csv('data/women_percent_deg_usa_subset.csv')
df.head()

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
0,1970,4.229798,9.064439,0.8,77.1,44.4
1,1971,5.452797,9.503187,1.0,75.5,46.2
2,1972,7.42071,10.558962,1.2,76.9,47.6
3,1973,9.653602,12.804602,1.6,77.4,50.4
4,1974,14.074623,16.20485,2.2,77.9,52.6


### Random sampling

Sometimes data is really large, and for visualization you may want to only show a random subset of your data. We can use `df.sample()` and specify the number of samples, or the fraction of the original.

*Notice that each time we run `.sample()` we get a different subset. Execute this line more than once and compare.*

In [3]:
df.sample(n=5)

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
2,1972,7.42071,10.558962,1.2,76.9,47.6
20,1990,32.703444,47.200851,14.1,83.9,72.6
26,1996,38.969775,48.647393,16.7,81.3,73.9
24,1994,36.032674,47.983924,15.7,81.8,72.9
6,1976,22.25276,23.430038,4.5,79.2,56.9


In [4]:
df.sample(frac=0.1)

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
28,1998,41.912403,49.258515,17.8,82.1,75.1
27,1997,40.685685,48.56105,17.0,81.9,74.4
2,1972,7.42071,10.558962,1.2,76.9,47.6
25,1995,36.844807,48.573181,16.2,81.5,73.0


### Creating duplicates with the sample

Now, we'll create a new DataFrame that has duplicates built in by concatenating our original DataFrame with a sample of its own rows. 

*You give `pd.concat()` a list of all the DataFrames you want to concatenate.*

In [5]:
df_w_dups = pd.concat([df, df.sample(n=5)])
df_w_dups.tail(10)

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
37,2007,47.605026,49.000459,16.8,85.4,77.1
38,2008,47.570834,48.888027,16.5,85.2,77.2
39,2009,48.667224,48.840474,16.8,85.1,77.1
40,2010,48.730042,48.757988,17.2,85.0,77.0
41,2011,50.037182,48.180418,17.5,84.8,76.7
10,1980,30.75939,36.765725,10.3,83.5,65.1
18,1988,31.085087,46.764828,13.9,85.2,70.9
35,2005,47.672754,49.791851,17.9,86.0,77.5
19,1989,31.612403,46.781565,14.1,84.6,71.6
11,1981,31.318655,39.26623,11.6,84.1,66.9


## Reset index

In a minute we'll remove duplicate rows, but I want to show you that it's not detecting duplicates in the Index.

First, it's a little weird to me that the Index doesn't have to be unique. (If you notice above, some index values are repeated after the concatenation.) That evidently affects the speed of lookup – unique Index is faster – but it's allowed. 

- Sometimes, though, after a sorting operation, or something like this, you'll want to reset the index to a new range of integers. You do that with `df.reset_index()`. 
- It's also a way to move the Index to a regular column, which comes in handy sometimes, too. 
- Here I'll do it "inplace". 

**Remember to be careful with "inplace" operations, because you'll be writing over your original data!** 

(You have the option of dropping the index with `drop=True`, but here I'll leave it to show you how to explicitly drop a column.)

In [6]:
df_w_dups.reset_index(inplace=True)
df_w_dups.tail(10)

Unnamed: 0,index,Year,Agriculture,Business,Engineering,Health,Psychology
37,37,2007,47.605026,49.000459,16.8,85.4,77.1
38,38,2008,47.570834,48.888027,16.5,85.2,77.2
39,39,2009,48.667224,48.840474,16.8,85.1,77.1
40,40,2010,48.730042,48.757988,17.2,85.0,77.0
41,41,2011,50.037182,48.180418,17.5,84.8,76.7
42,10,1980,30.75939,36.765725,10.3,83.5,65.1
43,18,1988,31.085087,46.764828,13.9,85.2,70.9
44,35,2005,47.672754,49.791851,17.9,86.0,77.5
45,19,1989,31.612403,46.781565,14.1,84.6,71.6
46,11,1981,31.318655,39.26623,11.6,84.1,66.9


## Rename a column

For fun and instruction we'll rename the old "index" column before we delete it. We do that by specifying that we're renaming columns, and then feed `rename` a dictionary with the old name as the key, and the new name as the value.

In [7]:
df_w_dups.rename(columns={'index':'old_index'}, inplace=True)
df_w_dups.tail(10)

Unnamed: 0,old_index,Year,Agriculture,Business,Engineering,Health,Psychology
37,37,2007,47.605026,49.000459,16.8,85.4,77.1
38,38,2008,47.570834,48.888027,16.5,85.2,77.2
39,39,2009,48.667224,48.840474,16.8,85.1,77.1
40,40,2010,48.730042,48.757988,17.2,85.0,77.0
41,41,2011,50.037182,48.180418,17.5,84.8,76.7
42,10,1980,30.75939,36.765725,10.3,83.5,65.1
43,18,1988,31.085087,46.764828,13.9,85.2,70.9
44,35,2005,47.672754,49.791851,17.9,86.0,77.5
45,19,1989,31.612403,46.781565,14.1,84.6,71.6
46,11,1981,31.318655,39.26623,11.6,84.1,66.9


## Drop a column

Now we don't really need the "index" column, and it's good to know how to delete columns you don't want.

Here we'll use the `columns=` argument, but you can drop rows the same way with `rows=` and a list of rows.

In [8]:
df_w_dups.drop(columns='old_index', inplace=True)
df_w_dups.tail(10)

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
37,2007,47.605026,49.000459,16.8,85.4,77.1
38,2008,47.570834,48.888027,16.5,85.2,77.2
39,2009,48.667224,48.840474,16.8,85.1,77.1
40,2010,48.730042,48.757988,17.2,85.0,77.0
41,2011,50.037182,48.180418,17.5,84.8,76.7
42,1980,30.75939,36.765725,10.3,83.5,65.1
43,1988,31.085087,46.764828,13.9,85.2,70.9
44,2005,47.672754,49.791851,17.9,86.0,77.5
45,1989,31.612403,46.781565,14.1,84.6,71.6
46,1981,31.318655,39.26623,11.6,84.1,66.9


## Drop duplicate rows

Now we'll drop (delete) any duplicate rows. By default it looks at all columns (but not the Index). You can specify a subset of rows to consider for duplicates with the `subset=[]` argument. Here we'll just stick with looking at all columns.

*Notice that the index only goes to 41 now, and we had added 5 duplicate rows earlier.*

In [9]:
df_wo_dups = df_w_dups.drop_duplicates()
df_wo_dups.tail(10)

Unnamed: 0,Year,Agriculture,Business,Engineering,Health,Psychology
32,2002,47.134658,50.552335,18.7,85.8,77.7
33,2003,47.935187,50.345598,18.8,86.5,77.8
34,2004,47.88714,49.950894,18.2,86.5,77.8
35,2005,47.672754,49.791851,17.9,86.0,77.5
36,2006,46.7903,49.210914,16.8,85.9,77.4
37,2007,47.605026,49.000459,16.8,85.4,77.1
38,2008,47.570834,48.888027,16.5,85.2,77.2
39,2009,48.667224,48.840474,16.8,85.1,77.1
40,2010,48.730042,48.757988,17.2,85.0,77.0
41,2011,50.037182,48.180418,17.5,84.8,76.7


---

# Concatenate multiple files

Now we'll do the concatenation in a more typical situation – combining data from multiple files into a single DataFrame.

Remember that what we feeded `pd.concat()` was a list of DataFrames, so if we're reading data from multiple files, all we have to do is

- Get a list of the files we want to combine
- Create an empty list to hold all the DataFrames
- Loop through the files
    - Read each file into a new DataFrame
    - Add columns to specify where the data came from
    - Add the new DataFrame to a list
- Concatenate the list of DataFrames

To get the list of files, we'll use the `glob` module, which, according to 
[the glob documentation page](https://docs.python.org/3/library/glob.html), 
"finds all the pathnames matching a specified pattern according to the rules used by the Unix shell".

In [10]:
from glob import glob

In [12]:
glob('data/*baby_names*')

['data/NC_baby_names_2016.csv',
 'data/NC_baby_names_1916.csv',
 'data/MN_baby_names_2016.csv',
 'data/MN_baby_names_1916.csv']

#### The whole code

In [14]:
file_list = glob('data/*baby_names*')
df_list = []

for file in file_list:
    df = pd.read_csv(file)
    df['state'] = file[5:7]
    df['year'] = file[19:23]
    df_list.append(df)

df_all = pd.concat(df_list)
df_all.reset_index(drop=True, inplace=True)
df_all

Unnamed: 0,sex,name,births,state,year
0,F,Ava,618,NC,2016
1,F,Emma,607,NC,2016
2,F,Olivia,557,NC,2016
3,F,Charlotte,407,NC,2016
4,F,Harper,404,NC,2016
...,...,...,...,...,...
75,M,Harold,428,MN,1916
76,M,James,385,MN,1916
77,M,Edward,346,MN,1916
78,M,Raymond,346,MN,1916
