# Finding Common Sentences in <i> The X-Files</i>.

Wikipedia has an article for every episode of the television show ['The X-Files'](https://en.wikipedia.org/wiki/The_X-Files). These articles contain lots of copy/pasted information.

I wanted to see what sentences were the most common in the articles and what percentage of the articles contain each sentence.


---

In [40]:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

In [2]:
def get_soup(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    return soup

In [4]:
soup = get_soup('https://en.wikipedia.org/wiki/List_of_The_X-Files_episodes')
tables = soup.find_all('table', {'class':'wikitable plainrowheaders wikiepisodetable'})
links = []
for table in tables:
    rows = table.find_all('td', {'class':'summary'})
    for i in rows:
        links.append('https://en.wikipedia.org' + i.find('a')['href'])

# Obtaining and Cleaning the data

I used requests and beautiful soup to get a list of links from the wiki page containing [every X Files episode.](https://en.wikipedia.org/wiki/List_of_The_X-Files_episodes)

I then downloaded every page linked (219 articles) and used a simple cleaning method to isolate the body of the article and remove unwanted formatting characters such as newlines and "\"s and also removed any period that didn't mark the end of a sentence (for instance "Washington D.C." became "Washington DC"). This makes it easier to break the article into sentences.

In [3]:
def get_paragraphs(soup):    
    paragraphs = []
    for i in soup.find_all('p'):
        #identifies all p elements that contain alphanumeric characters
        if re.search('[a-zA-Z]',i.text):
            #removes all instances of "\" and newline
            text = re.sub('[\\\n]','',i.text)
            #searches for periods that are preceded by a single uppercase character and removes them.
            text = re.sub(r'(?<!\w)([A-Z])\.', r'', text)
            paragraphs.append(text)
    return paragraphs

In [5]:
paras = []
for count,i in enumerate(links):
    paras.append(get_paragraphs(get_soup(i)))
    print(int(count/len(links)*100),end = '\r')
print(len(paras))

219


---
Since the only periods remaining were sentence enders, and Wikipedia doesn't end sentences with "!" or "?", I could split the articles into sentences by finding the periods. 

The only problem was my regex function missed the abbreviation "Dr." because the period is preceded by a lowercase character. Since there were only 10 instances of the word, I decided the best solution was to just remove it entirely.

In [75]:
sentences = []
for i in paras:
    for j in i:
        for k in j.split('. '):
            if k != '':
                sentences.append(k.lower())

series = pd.Series(sentences)
series = series[series != 'dr']

# Analysis and Visualization

I used Pandas and Plotly for this portion.

The table shows the top 10 sentences, how many articles each appeared in, and what percentage of all 219 articles they appeared in. The graph shows the same 10 sentences with their absolute frequencies on the y-axis.

In [104]:
df = pd.DataFrame({'Sentence' : series.value_counts().keys(),
                   'Count': series.value_counts().values
                  })
df['Percentage'] = df['Count'].apply(lambda x: round(x/len(paras)*100,2), 1)
trace0 = go.Table(
    columnorder = [1,2,3,4],
    columnwidth = [40,80,80,400],
    header = dict(values = ['','Count', 'Percentage','Sentence'], align = ['center','center','center','center'],
                  font = dict(color = 'white', size = 14),
                  fill = dict(color = ['#f2f2f2','#ec8879']),
                  height = 30),
    cells = dict(values = [[1,2,3,4,5,6,7,8,9,10],df['Count'].head(10),df['Percentage'].head(10),df['Sentence'].head(10)], 
                 align = ['center','center','center', 'left'],
                 fill = dict(color = ['#f2f2f2','white','#f2f2f2','white']),
                 height = 50)
)

py.iplot([trace0], filename='common_sentences_table')

In [84]:
series.value_counts().head(10).iplot(kind='bar', title = 'Most Common Sentences', filename='most_common_sentences_chart')

# Findings

Many of the articles show signs of copy/pasting:

* <b>61.64%</b> of the articles contained the sentence <i>"The show centers on FBI Special Agents Fox Mulder (David Duchovny) and Dana Scully (Gillian Anderson) who work on cases linked to the paranormal, called X-Files"</i>. Two of the other most common sentences mirror this one closely but add in the characters <i>"John Doggett (Robert Patrick)"</i> and <i>"Monica Reyes (Annabeth Gish)"</i>, who were both late additions to the show.


* Another <b>30.59%</b> contained <i>"Mulder is a believer in the paranormal, while the skeptical Scully has been assigned to debunk his work."</i>


* and <b>over a quarter</b> contained <i>"The episode is a "Monster-of-the-Week" story unconnected to the series' wider mythology".</i> Close variations of this sentence also occur quite frequently and make up 3 of the top 10 spots. Combining their frequencies, sentences mentioning the term <i>"Monster-of-the-Week"</i> occur in <b>35%</b> of the 219 articles.