# Classifying subreddits posts from UEFA Champions League and English Premier League

## Problem Statement

EuroFootball is a monthly football magazine covering both the major leagues in Europe (Premier League, Serie A,La Liga etc) as well as the European competition such as the UEFA Champions League and Europa League.

One of the columns in EuroFootball publishes readers' comments on any topics on the teams they support. Invariably, these comments usually revolves around players' performance, celebration of winning an important match and criticism of the manager's selection.

Fans submit their comments through email, the best ones of which are then reviewed and selected by the editor for publication. However, this is an extremely time-consuming process where the submitted comments needs to be sorted into topics before the editor review. In particular, comments regarding the English Premier League and the UEFA Champions League forms the majority.

The editorial team in EuroFootball has therefore tasked its data analytics team with creating a model which can predict whether comments submitted by readers has to do with the English Premier League or the UEFA Champions League.

In order for the model to predict and categorise the comments correctly, the model needs to be trained on labeled data. The data analytics team therefore propose to scrape posts from Reddit under the the r/championsleague and r/PremierLeague for the labeled data.

**About Reddit**

Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover a variety of topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing. 



## Objective

The main objective of the project is as follows:

1. Create a comments classifier model which can categorize readers comments into topics regarding UEFA Champions League and English Premier League
2. Identify the most important words that distinguish comments regarding UEFA Champions League and English Premier League

## Date Scraping

We will first scrape the posts from reddit using [*Pushshift's*](https://github.com/pushshift/api) API.

In [2]:
import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords, wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import pos_tag

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier


pd.options.display.max_colwidth = 400

### Custom Function

In [3]:
def get_reddit_post(subreddit, post):
    '''This function takes in the title of the subreddit in the form of a string and the number of posts 
    without empty and removed selftext and returns a DataFrame containing the required information'''
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
    'subreddit': subreddit,
    'size': 100    
    }
    res = requests.get(url, params) #scrape the first 100 posts
    print(res.status_code)
    df = pd.DataFrame(res.json()['data'])
    earl_date = df.loc[99, 'created_utc'] #capture the earliest date of the first 100 posts for reference 
                                          #for the loop later
    
    # number of posts after removing those with empty string or '[removed]' in the selftext column
    num_posts = df.loc[(df['selftext']!='')&(df['selftext']!='[removed]'), :].shape[0]
    
    # loop to retrieve the required numbe of non-empty or non-removed posts until 'post' is reached
    while num_posts < post:
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before': earl_date
        }
        res = requests.get(url, params)
        print(res.status_code)
        if res.status_code != 200:
            continue
        df2 = pd.DataFrame(res.json()['data'])
        earl_date = df2.tail(1)['created_utc']
        df = pd.concat([df, df2]) # concatenate the newly scraped dataframe to the current one
        num_posts = df.loc[(df['selftext']!='')&(df['selftext']!='[removed]'), :].shape[0]
        print(f'{num_posts} posts without empty or removed selftext scraped')
    
    return df.reset_index()

### Scrape the subreddits 

We will first scrape for the r/PremierLeague and r/champiosnleague subreddit for 1000 posts with text

In [4]:
df_epl = get_reddit_post('PremierLeague', 1000)

200
200
95 posts without empty or removed selftext scraped
200
145 posts without empty or removed selftext scraped
200
198 posts without empty or removed selftext scraped
200
242 posts without empty or removed selftext scraped
200
285 posts without empty or removed selftext scraped
200
318 posts without empty or removed selftext scraped
200
358 posts without empty or removed selftext scraped
200
402 posts without empty or removed selftext scraped
200
444 posts without empty or removed selftext scraped
200
495 posts without empty or removed selftext scraped
200
534 posts without empty or removed selftext scraped
200
581 posts without empty or removed selftext scraped
200
627 posts without empty or removed selftext scraped
200
662 posts without empty or removed selftext scraped
200
698 posts without empty or removed selftext scraped
200
744 posts without empty or removed selftext scraped
200
795 posts without empty or removed selftext scraped
200
841 posts without empty or removed selfte

In [5]:
df_cl = get_reddit_post('championsleague', 1000)

200
200
61 posts without empty or removed selftext scraped
200
78 posts without empty or removed selftext scraped
200
99 posts without empty or removed selftext scraped
200
123 posts without empty or removed selftext scraped
200
150 posts without empty or removed selftext scraped
200
178 posts without empty or removed selftext scraped
200
204 posts without empty or removed selftext scraped
200
233 posts without empty or removed selftext scraped
200
270 posts without empty or removed selftext scraped
200
308 posts without empty or removed selftext scraped
200
333 posts without empty or removed selftext scraped
200
362 posts without empty or removed selftext scraped
200
384 posts without empty or removed selftext scraped
200
404 posts without empty or removed selftext scraped
200
434 posts without empty or removed selftext scraped
200
460 posts without empty or removed selftext scraped
200
480 posts without empty or removed selftext scraped
200
505 posts without empty or removed selftext

### Inspect the DataFrames

In [6]:
df_epl.head()

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,crosspost_parent,crosspost_parent_list,gallery_data,is_gallery,media_metadata,discussion_type,distinguished,author_cakeday,poll_data,edited
0,0,[],False,mined_it,transparent,,"[{'a': ':liv:', 'e': 'emoji', 'u': 'https://emoji.redditmedia.com/7b4b1cctklg51_t5_2scup/liv'}]",aabec794-dcbc-11ea-8fe1-0e19d6f66ce5,:liv:,dark,...,,,,,,,,,,
1,1,[],False,Tesus4,,,[],,,,...,,,,,,,,,,
2,2,[],False,Tesus4,,,[],,,,...,,,,,,,,,,
3,3,[],False,AutoModerator,,,[],,,,...,,,,,,,,,,
4,4,[],False,Sports_Hat,,,[],,,,...,,,,,,,,,,


In [7]:
df_cl.head()

Unnamed: 0,index,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,...,og_title,gilded,rte_mode,author_id,brand_safe,suggested_sort,approved_at_utc,banned_at_utc,view_count,author_created_utc
0,0,[],False,zaviews,,[],,text,t2_106sw9,False,...,,,,,,,,,,
1,1,[],False,GoalooES,,[],,text,t2_b6693nfb,False,...,,,,,,,,,,
2,2,[],False,Structure-Diligent,,[],,text,t2_7fv9g819,False,...,,,,,,,,,,
3,3,[],False,GoalooES,,[],,text,t2_b6693nfb,False,...,,,,,,,,,,
4,4,[],False,MatchCaster,,[],,text,t2_7h7t33at,False,...,,,,,,,,,,


### Save Scraped Data to csv file

Both the 'championsleague' and 'PremierLeague' subreddits are saved and processed in the next notebook

In [8]:
df_epl.to_csv('../data/epl.csv', index=False)

In [9]:
df_cl.to_csv('../data/cl.csv', index=False)