# Title

Introduction section

Mention algorithms, technologies, etc. We will use LDA, w2v, etc.

## Data engineering

How did we collect this data, command used, how it was combined together

### Scraping

This data was collected using [Twarc](https://github.com/DocNow/twarc), a command-line tool for scraping data from Twitter. We used the following bash command:

```sh
twarc --recursive search '"transgender" OR "trans person" OR "trans people" OR "transmasc" OR "transfem" OR "trans man" OR "trans woman" OR "trans boy" OR "trans girl" OR "trans men" OR "trans women" OR "enby" OR "non binary"' | tee /dev/tty | gzip --stdout > $OUTFILE
```

This command does the following things:
1. Searches for every tweet from the past week that contains the specified terms.
2. Prints it to console for easier reading.
3. Gzips the data to reduce disk space
4. Saves the data to wherever `$OUTFILE` is.

### Combining the data

Now, we will combine all the data into a single dataframe. 

In [3]:
import pickle
import sys
import os

import gzip
from datetime import datetime

import jsonlines
import pandas as pd

def autopickle(path):
    """
    This is a decorator to aid in pickling important things.
    
    If a file exists at the path, then this will load that gzipped pickle object.
    Otherwise, it will run the function, pickle and gzip the result, and return the function.
    """
    def decorator(func):
        if os.path.exists(path):
            with gzip.open(path, 'rb') as file:
                model = pickle.load(file)
        else:
            model = func()
            with gzip.open(path, 'wb') as file:
                pickle.dump(model, file)
        return model
    return decorator


def parse_twitter_datetime(dt: str):
    return datetime.strptime(dt, '%a %b %d %H:%M:%S +0000 %Y')


def read_jsonl_gz(path):
    with jsonlines.Reader(gzip.open(path)) as reader:
        raw_politician_tweets = list(reader)

    tweet_df = pd.DataFrame(data={
        'tweet': [t['full_text'] for t in raw_politician_tweets],
        'author': [t['user']['screen_name'] for t in raw_politician_tweets],
        'date': [parse_twitter_datetime(t['created_at']) for t in raw_politician_tweets],
        'id': [t['id'] for t in raw_politician_tweets]
    })
    tweet_df.set_index('id')

    return tweet_df

In [None]:
@autopickle('../data/joined_tweets.pickle.gz')
def tweet_df():
    INPUTS = [
        '../data/transgender/2021-05-06_2021-05-13.jsonl.gz',
        '../data/transgender/2021-05-12_2021-05-20.jsonl.gz',
    ]
    tweet_df = pd.concat([read_jsonl_gz(path) for path in PATHS])
    tweet_df.drop_duplicates('id', inplace=True)
    tweet_df.to_pickle('../data/agg_trans_tweets.pickle.gz', compression='gzip')
    
    

## Feature engineering

TFIDF, cleaning, tokenization, pre-processing


## Analysis, Model Training

N-grams

LDA training + results

## Conclusion

reflection

next steps

