# Politcs and Social Sciences

The aim of this notebook is to analyze the text from any wikipedia page. The code below connects with wikipedia, imports the text found on the page and prepares it for further analysis. By default it uses the wikipedia page about Donald Trump but you are free to use any other page, just change the link used in the code. 

In [1]:
# Import packages
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import string

# Specify url of the web page
source = urlopen('https://en.wikipedia.org/wiki/Donald_Trump').read()
soup = BeautifulSoup(source,'lxml')

# Extract the plain text content from paragraphs
text = ''
for paragraph in soup.find_all('p'):
    text += paragraph.text

# Clean text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', ' ')
text = text.replace('\xa0', '')
text = text.replace('\'s', '')

Check the type of `text`

In [2]:
type(text)

str

Print first 1000 characters of `text`

In [3]:
print(text[:1000])

 Donald John Trump (born June 14, 1946) is an American politician who was the 45th president of the United States from 2017 to 2021. Before entering politics, he was a businessman and television personality. Born and raised in Queens, New York City, Trump attended Fordham University for two years and received a bachelor degree in economics from the Wharton School of the University of Pennsylvania. He became the president of his father Fred Trump real estate business in 1971, which he renamed The Trump Organization; he expanded the company operations to building and renovating skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, mostly by licensing his name. Trump and his businesses have been involved in more than 4,000 state and federal legal actions, including six bankruptcies. He owned the Miss Universe brand of beauty pageants from 1996 to 2015, and produced and hosted the reality television series The Apprentice from 2004 to 2015. Trump politic

We would like now to split the text into a list of sentences. For that we will need a way of detecting the end of a sentence. We will assume that a dot, exclamation mark or question mark followed by a space is the indicator of the end of a sentence. Complete the function below which checks if `char` is an end of sentence marker (`". "`, `"? "` or `"! "`)

In [4]:
def end_of_sentence(char):
    return char == ". " or char == "! " or char == "? "

# these tests should return True, True, False if your code is correct
print(end_of_sentence(". "))
print(end_of_sentence("? "))
print(end_of_sentence("D."))

True
True
False


The function split_sentences takes as argument a text represented by a simple string. Within the function we define a variable `sentences_ls` in which we will store the individual sentences. We need to extract both the start position and the end position of each sentence, we define the variables `start` and `end` and set to 0.
Next we iterate over the entire text. In each run of the while loop we look at two consecutive characters of the text. If these are the end of sentence markers we extract the relevant fragment of the text and append it to the list of sentences.

In [5]:
def split_sentences(text):
    
    sentences_ls = []
    
    start = 0
    end = 0
    
    while end < len(text):
        char = text[end:end+2]
        if end_of_sentence(char):
            #extract the sentence
            sentence = text[start:end+1].strip()
            
            #append the sentence to the list of sentences
            sentences_ls.append(sentence)
            
            #update the starting value
            start = end + 1
            
        end = end + 1
        
    return sentences_ls

Test the functionning of `split_sentences` on the first 1000 characters of `text`

In [6]:
split_sentences(text[:1000])

['Donald John Trump (born June 14, 1946) is an American politician who was the 45th president of the United States from 2017 to 2021.',
 'Before entering politics, he was a businessman and television personality.',
 'Born and raised in Queens, New York City, Trump attended Fordham University for two years and received a bachelor degree in economics from the Wharton School of the University of Pennsylvania.',
 'He became the president of his father Fred Trump real estate business in 1971, which he renamed The Trump Organization; he expanded the company operations to building and renovating skyscrapers, hotels, casinos, and golf courses.',
 'Trump later started various side ventures, mostly by licensing his name.',
 'Trump and his businesses have been involved in more than 4,000 state and federal legal actions, including six bankruptcies.',
 'He owned the Miss Universe brand of beauty pageants from 1996 to 2015, and produced and hosted the reality television series The Apprentice from 20

If the function works correctly define a new variable called `sentences_ls` and assign to it the result of calling the function `split_sentences` on the whole `text`

In [7]:
sentences_ls = split_sentences(text)

The function `tokenize` preprocesses a plain string to remove any punctuation and splits it into the list of lower case words.

In [8]:
def tokenize(sentence):
    sentence = re.sub(r'[^\w\s]', '', sentence).lower()
    words = sentence.split()
    return words

Test the functionning of `tokenize` on the first element of `sentences_ls`

In [9]:
tokenize(sentences_ls[0])

['donald',
 'john',
 'trump',
 'born',
 'june',
 '14',
 '1946',
 'is',
 'an',
 'american',
 'politician',
 'who',
 'was',
 'the',
 '45th',
 'president',
 'of',
 'the',
 'united',
 'states',
 'from',
 '2017',
 'to',
 '2021']

## General Text Statistics

We would now like to make some basic statistics about our text. In particulat we would like to get the following information:
 - Total number of words in the text
 - Total number of sentences in the text
 - Number of words in shortest sentence
 - Number of words in longest sentence
 - Average number of words in a sentence

 To get these values we need to find out what is the number of words in every sentence. Define an empty list called. `length_of_sentences`. Inside the `for` loop (we will learn about for loops later during the course) write one line of code which will append the length of `sentence` to the `length_of_sentences` list.

In [10]:
length_of_sentences = []
for sentence in sentences_ls:
    #append the length of sentence to the list
    length_of_sentences.append(len(sentence))

Define a variable called `words_ls` and assign to it the result of calling the `tokenize` function on the whole text

In [11]:
words_ls = tokenize(text)

Now we have all the ingredients to generate the statistics about our text. Print them out:

In [12]:
print("Total number of words in the text:", len(words_ls))
print("Total numer of sentences in the text:", len(sentences_ls))
print("Number of words in shortest sentence:", min(length_of_sentences))
print("Number of words in longest sentence:", max(length_of_sentences))
print("Average number of words in a sentence:", len(words_ls)/len(length_of_sentences))

Total number of words in the text: 17495
Total numer of sentences in the text: 792
Number of words in shortest sentence: 4
Number of words in longest sentence: 822
Average number of words in a sentence: 22.089646464646464


Notice that the number of words in shortest sentence is only 4. This seems a little bit suspicious. We would like to investigate this sentence. Write code which will print only the sentences whose length is smaller than 5.  
**HINT 1:** Use the `while` loop and a dummy variable `i` which will represent the current index in the `length_of_sentences` list. Set this variable `i` initially to 0 and with each iteration of the while loop increase it by one.  
**HINT 2:** What is the condition for your loop to end? `i` certainly cannot be larger than the the length of the `length_of_sentences`

In [13]:
i = 0
while i < len(sentences_ls):
    if length_of_sentences[i] == 4:
        print(sentences_ls[i])
    i = i + 1

U.S.
U.S.
aid.
Sen.


## Reading Time

On average people read around 220 words per minute. Using this information we can compute the estimated reading time of an article from the formula:
$$\textrm{reading time}= \dfrac{\textrm{number of words in text}}{\textrm{number of words per minute}}$$
Complete the function below to return the reading time (rounded to 2 decimal places)given a list of words from the text as input

In [14]:
def reading_time(words_ls):
    total_words = len(words_ls)
    reading_time = round(total_words / 220, 2)
    return reading_time

Ask the user for input: How much time do they have to read a text? Based on that information print if they are able to read the text in this time. How much will it take them to read the whole text. How many more spare minutes they will have or what percentage of the text will they be able to read given their time constraint.  
Remember, when asking for the user input remember to check the validity of the input (use the `while` loop for that)

In [15]:
time = -1
while time < 0:
    time = int(input("How many minutes do you have? "))

time_to_read = reading_time(words_ls)

if time < time_to_read:
    print("You don't have enough time to read this text. It takes", time_to_read, "minutes to read the entire text.", 
          "\r\nYou will only manage to read", round((time/time_to_read)*100,2), "% of the text")
else:
    print("You can read the whole article. It will take you", time_to_read, "minutes and you will have", 
          round(time - time_to_read,2), "minutes left.")

You can read the whole article. It will take you 79.52 minutes and you will have 20.48 minutes left.


## Word Counter

The last thing we would like to investigate is to find the most popular words in the entire text. To do this we will make use of a built-in module Counter. The code below assigns to the `word_rank` a list of tuples `(str, count of str)` which is sorted by the most common words in word_list.

In [16]:
from collections import Counter

word_rank = Counter(words_ls).most_common()

Print the top 20 words in word_rank

In [17]:
print(word_rank[:20])

[('the', 1061), ('trump', 597), ('in', 566), ('and', 529), ('of', 503), ('to', 435), ('a', 324), ('his', 223), ('he', 171), ('that', 168), ('for', 160), ('was', 153), ('as', 141), ('on', 128), ('by', 119), ('with', 114), ('from', 102), ('had', 81), ('were', 80), ('us', 73)]


Ask the user for a word and print whether this word is present in the text. If it is, print additionally the count  of this word in the text. Remember to transform the user input to lower case before checking if it is present in the word_rank.  
**HINT 1**: Use the same strategy as before for iterating over the `word_rank` list using the `while` loop.  
**HINT 2**: Create a boolean dummy variable `flag` which at the beginning will be set to `False`. If inside the loop, the word you are searching for appeares, change the value of `flag` to `True`. After iterating over the entire `word_rank` list, check the value of `flag`. Based on that conclude if the word you were searching for is present in the list or not.

In [18]:
word = input("Give a word: ").lower()
i = 0
flag = False #dummy variable to track the existance of the word

while i < len(word_rank):
    if word_rank[i][0] == word:
        print("This word appeared in the text: ", word_rank[i][1], "times")
        flag = True
    i = i + 1
    
if flag == False:
    print("This word is not present in the text")

This word appeared in the text:  597 times
