## Covid Tweet Misinformation Prediction Competition
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1. Get data in and set up X_train / X_test / y_train
2. Write preprocessor function and preprocess data
3. Fit model on preprocessed data and save preprocessor function and model
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard

## 1. Get data in and set up X_train, X_test, y_train objects

In [1]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/covid_tweet_competition_data-repository:latest')


Data downloaded successfully.


In [2]:
# Separate data into X_train, y_train, and X_test
import pandas as pd
X_train=pd.read_csv("covid_tweet_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("covid_tweet_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("covid_tweet_competition_data/y_train_labels.csv", squeeze=True)

X_train.head()

0    "[T]he label of the popular Lysol already show...
1    There were more deaths on the roads of France ...
2    250 new cases of #COVID19Nigeria; Plateau-69 F...
3    @XanderArmstrong Why was chloroquine described...
4    Our new Can Compare 'College' tags make it eas...
Name: tweet, dtype: object

## 2. Write preprocessor function and preprocess data

In [None]:
# Build a Document-Term Matrix (DTM) out of words in the training set
# Remove stop words that occur too frequently to be useful, and
# Use Term Frequency - Inverse Document Frequency (TF-IDF) formula to weight by how common words are generally

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_simple = tf_idf_vectorizer.fit(X_train)

In [None]:
# Write function to transform data with preprocessor
# New samples will be put into a DTM based on vocabularly from training set

def preprocessor(data):
    from sklearn.feature_extraction.text import TfidfVectorizer
    new_tfidf_df = tfidf_simple.transform(data)
    new_tfidf_df = new_tfidf_df.todense()
    return new_tfidf_df

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

##3. Fit model on preprocessed data and save preprocessor function and model

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=0)
model.fit(preprocessor(X_train), y_train_labels) # Fitting to the training set.
model.score(preprocessor(X_train), y_train_labels) # Fit score, 0-1 scale.