# Combining Multiple Files of the Same Type

This code is particularly helpful if you want to be able to merge a lot of files together into one dataframe, given that they are all named in the same pattern and they are all the same file type. 

This is the code that we used to merge 63 `.geo.tsv` files to get an entire dataframe with all of the information about 200+ million Reddit posts, which served as the base dataset that we merged parts of our other datasets to. We referenced this sample code from Github: https://github.com/ekapope/Combine-CSV-files-in-the-folder/blob/master/Combine_CSVs.py.

For the purposes of this data manipulation demo, because of data privacy, we've subsetted 4 dataframes with 25 data observations each, all with fake usernames to replace the 'author' column. We will merge these 4 files in this demo into a combined file with 100 data observations that will be used later.

## Import Libraries

Here we are importing the necessary libraries to run the code.

In [1]:
import os
import glob
import pandas as pd

## Combine Files

Since all of the files are formatted the same, we can combine them together into one file. We found that saving a file as a `.parquet` file was convenient for loading in the data later. 

The `os.getcwd()` method gets the current working directory that you are in, which should be inside the `data_wrangling` folder. However, to access the individual data files, we need to replace the current working directory with the directory that leads to the files. Once that has been done, we can go head and merge all of the files together into one file.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('data_wrangling', 'synthetic_data/combine_files')

os.chdir(DATA_DIR)
extension = 'parquet'
all_filenames = [i for i in glob.glob('*_synthetic.{}'.format(extension))]

combined_file = pd.concat([pd.read_parquet(f) for f in all_filenames])

In [4]:
combined_file.to_parquet("combined_posts_file.parquet")