# Synthetic Data Creation
Since we have to keep in consideration privacy of the users that provided their information for the geographical datasets, we had to create synthetic data to run our data manipulation and visualization demos for this repository. To make sure we didn't reveal any of the users' identities, we created a fake username for every author that appeared in our dataset and then used this fake username across all the different datasets to make sure the data was paired with the correct fake username across all dataframes. This file shows how we created this data.

## Import Libraries
We are reading in the necessary libraries to run our code and display both visualizations. If you are getting a warning that says that certain packages do not meet the current supported version or are incompatible, you can run the command `pip3 install --upgrade requests` in your terminal, which will make sure all of your packages are updated.

In [2]:
import numpy as np
import pandas as pd
import os
import glob

## Preparing Data for Conversion
Before we start working with the usernames in the location dataset, we need to prepare our geographical (Country/City) data with proper column names and make sure that we're only working with USA data because that is what we use to create our choropleth visualizations.

Using the `.rename()` function we give every column a proper name that describes its content accurately. Next, we select only the rows that have "United States" under the "Country" column.

In [3]:
location_df = pd.read_csv('/hpc/group/codeplus22-vis/redditdata/author_location.tsv')
location_df.rename(columns = {'Scarker':'author', 'Canada':'Country', 'Unnamed: 2':'State', 'Unnamed: 3':'City'}, inplace = True)
renamed_df = location_df[location_df['Country'] == 'United States']

## Creating and Adjusting Fake Usernames
Once the data has been adjusted, we can go ahead and create the fake usernames. To do this, we ran a for loop iterating through every row in the dataset and adding a new column titled 'author_synthetic' where every row included 'user_x' where x is the index number for that author.

In [4]:
number = 0
for index, row in renamed_df.iterrows():
    number = int(number)
    number += 1
    converted_num = str(number)
    author = 'user_' + converted_num 
    renamed_df.loc[index, "author_synthetic"] = author

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rslt_df.loc[index, "author_synthetic"] = author


Once the 'author_synthetic' column was created, we moved it to the front to compare it with the author column more easily, and made sure we dropped all `NaN` values in the dataframe.

In [5]:
last_column = renamed_df['author_synthetic']
renamed_df.drop(labels=['author_synthetic'], axis=1,inplace = True)
renamed_df.insert(0, 'author_synthetic', last_column)
renamed_df = renamed_df.dropna()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rslt_df.drop(labels=['author_synthetic'], axis=1,inplace = True)


Next, we took a random sample of 100 rows of the dataframe using the `.sample(n=100)` function. This way, we ensured that the authors we were using were completely random. However, we won't drop the 'author' column yet because we will use it to merge the data frame.

In [6]:
renamed_df_sample = renamed_df.sample(n = 100)

Next, we load the dataframe with author, latitide, longitude, and time information. Since the location and this data frame share the 'author' column, we merge them using the `.merge()` method. By merging with the 'author' column, we now end up with a dataframe that includes the information of both datasets. This means that the fake usernames we created for the authors in the previous data frame, are now paired with the authors in this data frame as well. After doing this, we drop the NaN values. 

In [7]:
latlong_df = pd.read_csv('/hpc/group/codeplus22-vis/redditdata/geo_known_tz.tsv', sep='\t')
merged_data = pd.merge(renamed_df_sample, latlong_df, on = 'author')
merged_data = merged_data.dropna()

Finally, we extracted the columns of this data frame that were included on the original dataset, but we replaced the 'author' column with the 'author_synthetic' column. This way, we ensured that the data in both data frames was going to be paired with the correct fake username, but we didn't reveal the username of whoever revealed their information.

In [8]:
author_location_synthetic = merged_data[['author_synthetic', 'Country', 'State', 'City']].copy()
author_location_synthetic

Unnamed: 0,author_synthetic,Country,State,City
0,user_43471,United States,Oregon,Multnomah County
1,user_38192,United States,Washington,King County
2,user_10589,United States,Colorado,Denver County
3,user_26046,United States,Texas,Harris County
4,user_2936,United States,Washington,Walla Walla County
...,...,...,...,...
95,user_43406,United States,Washington,King County
96,user_20428,United States,California,San Francisco County
97,user_4282,United States,Utah,Iron County
98,user_20721,United States,Kansas,Douglas County


In [9]:
geo_known_synthetic = merged_data[['author_synthetic', 'created_utc', 'long', 'lat', 'timezone']].copy()
geo_known_synthetic

Unnamed: 0,author_synthetic,created_utc,long,lat,timezone
0,user_43471,1580159877,-122.675026,45.505106,America/Los_Angeles
1,user_38192,1549478880,-122.332071,47.606209,America/Los_Angeles
2,user_10589,1348029321,-104.990251,39.739236,America/Denver
3,user_26046,1450025129,-95.369803,29.760427,America/Chicago
4,user_2936,1331525185,-118.343021,46.064581,America/Los_Angeles
...,...,...,...,...,...
95,user_43406,1579544282,-122.332071,47.606209,America/Los_Angeles
96,user_20428,1416259883,-122.419415,37.774929,America/Los_Angeles
97,user_4282,1331556979,-113.061893,37.677477,America/Denver
98,user_20721,1417659035,-95.235250,38.971669,America/Chicago


For this third dataset, we read it in as a csv file and made sure that we dropped the duplicate authors so that we worked with a wide variety of data. Next, we switched the columns so that they were all titled with a name that related to the information they contained. Finally, we dropped an 'index' column which was irrelevant for our visualizations so that this data frame could load faster in future steps.

In [None]:
post_info_df = pd.read_csv('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/combined_file.csv')
post_info_df = post_info_df.drop_duplicates(subset='id')
combined_df = post_infodf.reset_index().shift(1,axis=1)
combined_df.iloc[:,1] = post_info_df.reset_index().values[:,0]
combined_df.drop('index', inplace=True, axis=1)
combined_df = combined_df.drop_duplicates(subset='id')

  otherdf = pd.read_csv('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/combined_file.csv')


We then merged this dataframe with combined dataframe we created previously using the 'author' column. This ensured us that the fake username would once again be paired to the correct 'author' information they are covering.

In [14]:
real_final_data = pd.merge(merged_data, combined_df, on = 'author')
real_final_data = real_final_data.dropna()

Finally, we once again selected only the columns that were useful for our visualizations and that were included in the original dataset. However, we replaced the 'author' column with the 'author_synthetic' column to protect the users' privacy.

In [15]:
combined_file_synthetic = real_final_data[['author_synthetic', 'created_utc', 'subreddit']].copy()
combined_file_synthetic

Unnamed: 0,author_synthetic,created_utc,subreddit
0,user_43471,1580159877,Portland
1,user_38192,1549478880,bayarea
2,user_10589,1348029321,reddit.com
3,user_26046,1450025129,trees
4,user_2936,1331525185,pics
...,...,...,...
95,user_43406,1579544282,AmateurRoomPorn
96,user_20428,1416259883,movie_scores
97,user_4282,1331556979,AskReddit
98,user_20721,1417659035,nosleep


Since one of our first data wrangling demos shows how we originally merged all these files, we separated the data frame into 4 datasets of 25 rows so that this synthetic data can be used in the merging demo.

In [16]:
rows0_24 = combined_file_synthetic.iloc[0:25]
rows0_24

Unnamed: 0,author_synthetic,created_utc,subreddit
0,user_43471,1580159877,Portland
1,user_38192,1549478880,bayarea
2,user_10589,1348029321,reddit.com
3,user_26046,1450025129,trees
4,user_2936,1331525185,pics
5,user_42184,1573064210,AmItheAsshole
6,user_36122,1537446445,reddit.com
7,user_7328,1348026546,UMF
8,user_37932,1547821804,funny
9,user_19436,1407125893,todayilearned


In [17]:
rows25_49 = combined_file_synthetic.iloc[25:50]
rows25_49

Unnamed: 0,author_synthetic,created_utc,subreddit
25,user_23101,1426998618,funny
26,user_35582,1531837592,office
27,user_9095,1348027833,AdviceAnimals
28,user_27748,1462273431,blackops3
29,user_3551,1331533045,aww
30,user_33364,1512249325,childfree
31,user_28835,1474400083,askcarsales
32,user_14644,1357402950,firstworldproblems
33,user_41823,1570224358,boulder
34,user_9620,1348028300,reddit.com


In [18]:
rows50_74 = combined_file_synthetic.iloc[50:75]
rows50_74

Unnamed: 0,author_synthetic,created_utc,subreddit
50,user_32155,1504111903,IWatchedAnOldSeries
51,user_36321,1538450308,breastfeeding
52,user_29586,1480812660,MotoLA
53,user_36164,1537687637,Marriage
54,user_22636,1425491381,ilstu
55,user_13974,1348067346,AskReddit
56,user_11274,1348030248,TheLastAirbender
57,user_17767,1390168772,C25K
58,user_31608,1499694872,fffffffuuuuuuuuuuuu
59,user_26148,1450375114,3DSdeals


In [19]:
rows75_99 = combined_file_synthetic.iloc[75:100]
rows75_99

Unnamed: 0,author_synthetic,created_utc,subreddit
75,user_31961,1503074428,comics
76,user_35952,1535790145,carporn
77,user_9860,1348028532,AskReddit
78,user_7757,1348026847,fffffffuuuuuuuuuuuu
79,user_5770,1344582208,entertainment
80,user_16087,1375225094,aww
81,user_38312,1550382653,beardsgonecuddly
82,user_322,1257808247,AskReddit
83,user_40680,1563504603,teefies
84,user_8670,1348027461,firstworldproblems


## Saving the Data Frames as Parquet Files
To finish this process, we used the `.to_parquet()` method to save these data frames as `.parquet` files. We found that this allows for faster loading time when we load the files into the demos we prepared.

In [111]:
rows0_24.to_parquet('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/part1_synthetic.parquet')
rows25_49.to_parquet('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/part2_synthetic.parquet')
rows50_74.to_parquet('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/part3_synthetic.parquet')
rows75_99.to_parquet('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/part4_synthetic.parquet')

In [93]:
combined_file_synthetic.to_parquet('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/combined_file_synthetic.parquet')

In [94]:
geo_known_synthetic.to_parquet('/hpc/group/codeplus22-vis/readonlyredditdata/redditdata/geo_known_synthetic.parquet')