# Read Reddit Data

# Assumptions
We are NOT using data from `reddit.l2.raw` because they are minimally processed.

We assume every line in the 500K samples comes from a unique user.

# Issues
China is underrepresented.
* Reason: this can be explained by Chinese people have limited access to Reddit due to the firewall.
* Solution: randomly pick the same number of lines from the other countries.
* Consequence: lose information for other countries
* Benefits: we have smaller data to deal with

In [23]:
import pandas as pd
import random

In [25]:
"""
Read one file given the country name.
When limit (number of lines to choose) is positive, choose a `limit` of comments at random.

@param country: country name
@param native_language: language used in that country
@param limit: number of lines to randomly choose; if limit=-1, we maintain all the lines
@return a dataframe of "comments" and "native_language" where "native_language" is set to one value
"""
def read_file(country, native_language, limit=-1):
    file = open(f'../raw_data/reddit.l2/reddit.l2.clean.500K/reddit.{country}.txt.tok.clean.shf.500K.nometa.tc.noent.fw.url.lc', "r")
    lines = file.readlines()
    lines = [l.rstrip() for l in lines]
    if limit > 0:
        lines = random.sample(lines, limit)
    df = pd.DataFrame({"comments": lines, "native_language": native_language})
    print(f"{country}: {len(lines)} lines")
    return df

In [29]:
"""
Read files related to selected countries.
"limit" is set to an arbitrary number of 125073 because this is the maximum number of lines the Chinese document has.

@param countries: a dictionary where key = country name and value = language of that country
@return a dataframe of "comments" and "native_language" of all comments of the given countries
"""
def read_countries(countries):
    df = pd.DataFrame()
    for country in countries.keys():
        temp = read_file(country, countries[country], limit=125073)
        df = pd.concat([df, temp], ignore_index=True)
    return df

In [27]:
"""
Get dataframe of comments of selected countries.
"""
countries = {
    "China": "Chinese",
    "France": "French",
    "Germany": "German",
    "Greece": "Greek",
    "Portugal": "Portuguese",
    "Spain": "Spanish"
}
df = read_countries(countries)
df

China: 125073 lines
France: 125073 lines
Germany: 125073 lines
Greece: 125073 lines
Portugal: 125073 lines
Spain: 125073 lines


Unnamed: 0,comments,native_language
0,"she was born , lived and died on these soils w...",Chinese
1,her books are a waste of time .,Chinese
2,here 's a previous thread where i posted the s...,Chinese
3,NORP are from GPE which is not in LOC,Chinese
4,GPE tries to heavily promote uniquely NORP fac...,Chinese
...,...,...
750433,"> not according to the project website , libre...",Spanish
750434,did you see that all that has nothing to do wi...,Spanish
750435,then what has them saying it CARDINAL times to...,Spanish
750436,which i do n't have access to and i 'd have to...,Spanish


In [28]:
"""
Export dataframe to CSV file.
"""
df.to_csv("data/reddit_6languages.csv", index=False)