# 5. Live coding

This notebook will show you how we analyse a text in real life. To do so, we will examine two judge responses to asylum's claims in the UK.







Legend of symbols:

- 🤓: Tips

- 🤖📝: Your turn

- ❓: Question

- 💫: Extra exercise 

## 5.1. Read text

As we have learned in this course, the first step is to import the text into this notebook.

Two approaches:

- 1) Copy and paste content in a **<tt>.txt<tt>** file.
- 2) Install **<tt>pdftotext<tt>**: https://github.com/jalan/pdftotext.

In [None]:
# 1)
# Read the raw file from txt
f = open('../data/asylum_claims.txt','r')
text = f.read()
f.close()

In [None]:
# Let's substitute \n by spaces
import re
text = re.sub('\n', ' ', text)

In [None]:
print(text)

In [None]:
import pdfplumber

with pdfplumber.open("../data/PA059452018.pdf") as pdf:
    first_page = pdf.pages[0]
    pdf_11 = first_page.extract_text()
    second_page = pdf.pages[1]
    pdf_12 = second_page.extract_text()
    third_page = pdf.pages[2]
    pdf_13 = third_page.extract_text()
    fourth_page = pdf.pages[3]
    pdf_14 = first_page.extract_text()
    fifth_page = pdf.pages[4]
    pdf_15 = fifth_page.extract_text()

pdf_1 = pdf_11 + "\n" + pdf_12 + "\n" + pdf_13 + "\n" + pdf_14 + "\n" + pdf_15

In [None]:
with pdfplumber.open("../data/PA002402019.pdf") as pdf:
    first_page = pdf.pages[0]
    pdf_11 = first_page.extract_text()
    second_page = pdf.pages[1]
    pdf_12 = second_page.extract_text()
    third_page = pdf.pages[2]
    pdf_13 = third_page.extract_text()
    fourth_page = pdf.pages[3]
    pdf_14 = first_page.extract_text()
    fifth_page = pdf.pages[4]
    pdf_15 = fifth_page.extract_text()

pdf_2 = pdf_11 + "\n" + pdf_12 + "\n" + pdf_13 + "\n" + pdf_14 + "\n" + pdf_15

pdf = pdf_1 + "\n" + pdf_2

print(pdf)


## 5.2. Basic statistics

🤓 It is important when analysing text to know the basic figures: 
- How many words do we have? 
- How many sentences? 
- What are the most common words? 


❓ More questions?

### 5.2.1. How many words do we have? 

In [None]:
# How many words do we have?
words_txt = text.split()
len(words_txt)

In [None]:
words_pdf = pdf.split()
len(words_pdf)

In [None]:
# number of words in the pdf but not in the txt
len(set(words_pdf) ^ set(words_txt))

### 5.2.2. How many sentences do we have? 

In [None]:
# How many sentences do we have?
sent_txt = text.split('.')
len(sent_txt)

In [None]:
# How many sentences do we have?
sent_pdf = pdf.split('.')
len(sent_pdf)

In [None]:
# The above approach is limited as not all sentences are seperated 
# with a full stop 

# How many sentences do we have? Using all types of Punctuation marks
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

sent_txt_nltk = sent_tokenize(text)
print(len(sent_txt_nltk))

In [None]:
# How many sentences do we have?
sent_pdf_nltk = sent_tokenize(pdf)
print(len(sent_pdf_nltk))

### 5.2.3. What are the most common words? 

In [None]:
# What are the most common words? 
wordfreq_txt = []

# count words in text
for w in words_txt:
    wordfreq_txt.append(words_txt.count(w))
    
# create a list with words and its frequency
word_list = list(set(zip(words_txt, wordfreq_txt)))

    
print("Pairs\n" + str(word_list))

In [None]:
highest_value = [0]
word = [""]
for w in word_list:
    compare_value = w[1]
    if compare_value > highest_value[0]:
        highest_value[0] = compare_value
        word[0] = w[0]

In [None]:
print(word, highest_value)

In [None]:
# What are the most common words? 
wordfreq_pdf = []

# count words in text
for w in words_pdf:
    wordfreq_pdf.append(words_pdf.count(w))
    
# create a list with words and its frequency
word_list = list(set(zip(words_pdf, wordfreq_pdf)))

# function to sort the list by second item of tuple
def sort_pairs(tup): 
  
    # reverse = None (Sorts in Ascending order) 
    # key is set to sort using second element of 
    # sublist lambda has been used 
    return(sorted(tup, key = lambda x: x[1], reverse = True))  

word_list_sort = sort_pairs(word_list)
    
print("Pairs\n" + str(word_list_sort))

In [None]:
import pandas as pd

df = pd.DataFrame(word_list_sort)
df.columns = ['words', 'counts']

In [None]:
df.head()

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px

fig = px.bar(df.loc[0:10,:], x='counts', y='words', orientation='h', text = 'words',
             labels={
                     "counts": "Frequency",
                     "words": "Words"
                 },)
fig.layout.yaxis.type = 'category'
fig.update_layout(yaxis_categoryorder = 'total ascending')
fig.update_layout(yaxis=dict(showticklabels=False))
fig.update_traces(texttemplate='%{text}', textposition='auto', marker_color='green')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',   title={
        'text': "Words Frequency in Tribunal Appeals",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_layout(
    autosize=False,
    width=1050,
    height=500)
fig.show()

❓ Does that give information of the content?

## 5.3. Clean text

We now clean the text with some  techniques we have learned.

### 5.3.1. Lowercase

In [None]:
# Remove capital letters
text_clean = ' '.join(w.lower() for w in text.split())

In [None]:
text_clean

### 5.3.2. Stop words 

In [None]:
# Remove stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords = stopwords.words('english')

In [None]:
text_clean = ' '.join(w for w in text_clean.split() if w not in stopwords)

In [None]:
text_clean

### 5.3.3. Lemmatization

In [None]:
# Remove puncutaction symbols
import spacy
nlp = spacy.load('en_core_web_sm') 

text_clean = [[token.lemma_ for token in sentence] for sentence in nlp(text_clean).sents]

In [None]:
text_clean_flat = [word for sent in text_clean for word in sent]
text_clean_flat

### 5.3.4. Count words

In [None]:
from collections import Counter

text_clean_counter = dict(Counter(text_clean_flat))

In [None]:
df_clean = pd.DataFrame.from_dict(text_clean_counter, orient='index')

In [None]:
df_clean.reset_index(level=0, inplace=True)
df_clean.columns = ['words', 'counts']

In [None]:
fig = px.bar(df_clean.loc[0:10,:], x='counts', y='words', orientation='h', text = 'words',
             labels={
                     "counts": "Frequency",
                     "words": "Words"
                 },)
fig.layout.yaxis.type = 'category'
fig.update_layout(yaxis_categoryorder = 'total ascending')
fig.update_layout(yaxis=dict(showticklabels=False))
fig.update_traces(texttemplate='%{text}', textposition='auto', marker_color='purple')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',   title={
        'text': "Words Frequency in Tribunal Appeals",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_layout(
    autosize=False,
    width=1050,
    height=500)
fig.show()

## 5.4. Word cloud

Now, let's show the word cloud of this text:

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

word_cloud = WordCloud(background_color="white", repeat=True)
word_cloud.generate(text)

plt.axis("off")
plt.imshow(word_cloud, interpolation="bilinear")
plt.show()

## 5.5. What we have learned?

### 🤖📝 Now it's your turn:

🤖📝 Find the word 'EURODAC' using the function **<tt>search<tt>** from the **<tt>re<tt>** package.

In [None]:
for match in re.finditer("EURODAC", pdf):
    print(match)

In [None]:
df_clean[df_clean['words'].str.contains('eurodac', flags=re.IGNORECASE)]

🤖📝 Create a word cloud with different colour pattern using the text from the PDF.

In [None]:
word_cloud = WordCloud(background_color="skyblue", colormap="Blues",repeat=True) #skyblue
word_cloud.generate(text)

plt.axis("off")
plt.imshow(word_cloud, interpolation="bilinear")

plt.show()

🤖📝 Remove symbols from **<tt>df_clean<tt>** and plot again the frequency of words.

In [None]:
import string

In [None]:
punc_list = []
for punc in string.punctuation:
    punc_list.extend(punc)

punc_list.append("’s")

In [None]:
clean_df = df_clean[~df_clean['words'].isin(punc_list)].sort_values(by=['counts'], ascending=False).reset_index(drop=True).copy()

In [None]:
clean_df

In [None]:
fig = px.bar(clean_df.loc[0:10,:], x='counts', y='words', orientation='h', text = 'words',
             labels={
                     "counts": "Frequency",
                     "words": "Words"
                 },)
fig.layout.yaxis.type = 'category'
fig.update_layout(yaxis_categoryorder = 'total ascending')
fig.update_layout(yaxis=dict(showticklabels=False))
fig.update_traces(texttemplate='%{text}', textposition='auto', marker_color='purple')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',   title={
        'text': "Words Frequency in Tribunal Appeals",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.update_layout(
    autosize=False,
    width=1050,
    height=500)
fig.show()