# Merging Datasets

To create more insightful visualizations, it may be helpful to merge two or more datasets by common variables using the `pandas` library.

## Import Libraries

Here we are importing the necessary libraries to run the merging code.

In [1]:
import os
import pandas as pd

## Read in Datasets

Read in the datasets that you want to merge together. The `os.getcwd()` method gets the current working directory that you are in, which should be inside the `data_wrangling` folder. However, to access the data, we need to replace the current working directory with the directory that leads to the data files. Once that has been done, we can go head with reading in the data and performing the necessary data manipulations.

In [2]:
DATA_DIR = os.getcwd()
DATA_DIR = DATA_DIR.replace('data_wrangling', 'synthetic_data')

combined_df = pd.read_parquet(DATA_DIR + '/combined_posts_file.parquet')

In [3]:
location_data = pd.read_parquet(DATA_DIR + '/geo_known_synthetic.parquet')

In [5]:
subred_demo = pd.read_csv(DATA_DIR + '/socdimclusters.csv')

## Extract/Rename Desired Columns

To save space and increase efficiency, you can take only the columns that you want from each dataset so that the entirety of the datasets are not merged. You can also rename certain columns.

In [7]:
timezone_loc = location_data[["author_synthetic", "timezone", "created_utc", "long", "lat"]]
timezone_loc = timezone_loc.rename(columns={'created_utc': 'loc_reveal_time'})

In [8]:
subred_cluster = subred_demo[['name', 'cluster_name']]
subred_cluster = subred_cluster.rename(columns={'name': 'subreddit'})

## Merge Datasets

Now that the data has been prepared for merging, use the `.merge()` method from the pandas library to combine the datasets together. For more information, you can visit this site: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html.

In [9]:
combined_df['subreddit'] = combined_df['subreddit'].astype(str)

In [10]:
combined_df['author_synthetic'] = combined_df['author_synthetic'].astype(str)

In [11]:
combined_df1 = combined_df.merge(subred_cluster, how = 'outer', on = 'subreddit')

In [13]:
combined_df2 = combined_df1.merge(timezone_loc, how = 'outer', on = 'author_synthetic')

In [21]:
combined_df2 = combined_df2.dropna(how='any')

## Export as Parquet File

Now that all of the data merging is complete, you can export your dataframe as a `parquet` file, and when you want to read it in again in a separate notebook, you can use the `.read_parquet()` function.

In [22]:
combined_df2.to_parquet(DATA_DIR + '/merged_combined_file.parquet')