# Stop Words NLP Exercise
* Notebook by Adam Lang
* Date: 3/15/2024
* In this notebook we will do the following:
    1. Find the percentage of stopwords in a text file.
    2. Remove stopwords from the text.
    3. Analyze the findings.


## Importing the data

In [3]:
# read file using open()
file = open('/content/drive/MyDrive/Colab Notebooks/Classical NLP/switzerland.txt', mode='r',encoding='utf-8')

#read file as string
text = file.read()

# close file
file.close()

In [4]:
# view the text file
print(text)

Switzerland, officially the Swiss Confederation, is a country situated in the confluence of Western, Central, and Southern Europe. It is a federal republic composed of 26 cantons, with federal authorities based in Bern. Switzerland is a landlocked country bordered by Italy to the south, France to the west, Germany to the north, and Austria and Liechtenstein to the east. It is geographically divided among the Swiss Plateau, the Alps, and the Jura, spanning a total area of 41,285 km2 (15,940 sq mi), and land area of 39,997 km2 (15,443 sq mi). While the Alps occupy the greater part of the territory, the Swiss population of approximately 8.5 million is concentrated mostly on the plateau, where the largest cities and economic centres are located, among them Zürich, Geneva and Basel, where multiple international organisations are domiciled (such as FIFA, the UN's second-largest Office, and the Bank for International Settlements) and where the main international airports of Switzerland are.



## Cleaning Text

In [23]:
import re

# define a function to clean text
def clean_text(text):

  # lowercase text/normalize text
    text = text.lower()

  # remove punctuation, spaces and parenthesis
    text = re.sub('[,.)(\n]','',text)

  # replace hyphen with blank space
    text = re.sub('-',' ', text)

    return text


In [24]:
# clean text with function
cleaned_text=clean_text(text)

In [25]:
# view the cleaned text
print(cleaned_text)

switzerland officially the swiss confederation is a country situated in the confluence of western central and southern europe it is a federal republic composed of 26 cantons with federal authorities based in bern switzerland is a landlocked country bordered by italy to the south france to the west germany to the north and austria and liechtenstein to the east it is geographically divided among the swiss plateau the alps and the jura spanning a total area of 41285 km2 15940 sq mi and land area of 39997 km2 15443 sq mi while the alps occupy the greater part of the territory the swiss population of approximately 85 million is concentrated mostly on the plateau where the largest cities and economic centres are located among them zürich geneva and basel where multiple international organisations are domiciled such as fifa the un's second largest office and the bank for international settlements and where the main international airports of switzerland are


summary: we can see the text has been normalized

## Task 1: Find most frequent words
* We will use spacy

In [26]:
import spacy

# load spacy model
nlp = spacy.load('en_core_web_sm')


# create spacy doc object
doc = nlp(cleaned_text)

In [27]:
# count word frequencies
word_dict = {}

# add word-count pair to dictionary
for token in doc:
  # check if the token is already in dictionary
  if token.text in word_dict:
    # increment the count of word by 1
    word_dict[token.text]=word_dict[token.text]+1
  else:
    # Add word to dict with count 1
    word_dict[token.text]=1

In [28]:
# print keys
word_dict.keys()

dict_keys(['switzerland', 'officially', 'the', 'swiss', 'confederation', 'is', 'a', 'country', 'situated', 'in', 'confluence', 'of', 'western', 'central', 'and', 'southern', 'europe', 'it', 'federal', 'republic', 'composed', '26', 'cantons', 'with', 'authorities', 'based', 'bern', 'landlocked', 'bordered', 'by', 'italy', 'to', 'south', 'france', 'west', 'germany', 'north', 'austria', 'liechtenstein', 'east', 'geographically', 'divided', 'among', 'plateau', 'alps', 'jura', 'spanning', 'total', 'area', '41285', 'km2', '15940', 'sq', 'mi', 'land', '39997', '15443', 'while', 'occupy', 'greater', 'part', 'territory', 'population', 'approximately', '85', 'million', 'concentrated', 'mostly', 'on', 'where', 'largest', 'cities', 'economic', 'centres', 'are', 'located', 'them', 'zürich', 'geneva', 'basel', 'multiple', 'international', 'organisations', 'domiciled', 'such', 'as', 'fifa', 'un', "'s", 'second', 'office', 'bank', 'for', 'settlements', 'main', 'airports'])

In [29]:
# lets convert to df
import pandas as pd

#create df
df = pd.DataFrame({'word':list(word_dict.keys()), 'count':list(word_dict.values())})

# sort df in desc order
df.sort_values(by='count', ascending=False, inplace=True, ignore_index=True)


# print shape and head
print(f"Shape of df is: {df.shape}")
df.head(5)

Shape of df is: (96, 2)


Unnamed: 0,word,count
0,the,18
1,and,9
2,of,7
3,is,5
4,to,4


summary:
* There are 96 tokens in the dataframe.
* We can see the most common words here are all stop words.
* Thus for our NLP task we should remove these to find the more meaningful words in the text file.

## Task 2: Remove stopwords

In [30]:
# obtain words that are not stopwords
new_tokens = [token.text for token in doc if (token.is_stop == False)]

In [31]:
print(new_tokens)

['switzerland', 'officially', 'swiss', 'confederation', 'country', 'situated', 'confluence', 'western', 'central', 'southern', 'europe', 'federal', 'republic', 'composed', '26', 'cantons', 'federal', 'authorities', 'based', 'bern', 'switzerland', 'landlocked', 'country', 'bordered', 'italy', 'south', 'france', 'west', 'germany', 'north', 'austria', 'liechtenstein', 'east', 'geographically', 'divided', 'swiss', 'plateau', 'alps', 'jura', 'spanning', 'total', 'area', '41285', 'km2', '15940', 'sq', 'mi', 'land', 'area', '39997', 'km2', '15443', 'sq', 'mi', 'alps', 'occupy', 'greater', 'territory', 'swiss', 'population', 'approximately', '85', 'million', 'concentrated', 'plateau', 'largest', 'cities', 'economic', 'centres', 'located', 'zürich', 'geneva', 'basel', 'multiple', 'international', 'organisations', 'domiciled', 'fifa', 'un', 'second', 'largest', 'office', 'bank', 'international', 'settlements', 'main', 'international', 'airports', 'switzerland']


In [32]:
# extract these words into a dictionary
new_words_dict = {}

# Add word-count pair to dict
for token in new_tokens:
  # check if word is already in dict
  if token in new_words_dict:
    # increment count of word by 1
    new_words_dict[token] = new_words_dict[token]+1
  else:
    # add word to dict with count of 1
    new_words_dict[token] = 1

In [33]:
# print new dict keys
new_words_dict.keys()

dict_keys(['switzerland', 'officially', 'swiss', 'confederation', 'country', 'situated', 'confluence', 'western', 'central', 'southern', 'europe', 'federal', 'republic', 'composed', '26', 'cantons', 'authorities', 'based', 'bern', 'landlocked', 'bordered', 'italy', 'south', 'france', 'west', 'germany', 'north', 'austria', 'liechtenstein', 'east', 'geographically', 'divided', 'plateau', 'alps', 'jura', 'spanning', 'total', 'area', '41285', 'km2', '15940', 'sq', 'mi', 'land', '39997', '15443', 'occupy', 'greater', 'territory', 'population', 'approximately', '85', 'million', 'concentrated', 'largest', 'cities', 'economic', 'centres', 'located', 'zürich', 'geneva', 'basel', 'multiple', 'international', 'organisations', 'domiciled', 'fifa', 'un', 'second', 'office', 'bank', 'settlements', 'main', 'airports'])

In [35]:
# create a df now
new_df = pd.DataFrame({'word':list(new_words_dict.keys()), 'count':list(new_words_dict.values())})

# sort df in desc order
new_df.sort_values(by='count', ascending=False, inplace=True, ignore_index=True)

# print out results
print(f"Shape of new_df is: {new_df.shape}")

# print head of new df
new_df.head(10)

Shape of new_df is: (74, 2)


Unnamed: 0,word,count
0,switzerland,3
1,swiss,3
2,international,3
3,federal,2
4,largest,2
5,mi,2
6,sq,2
7,km2,2
8,alps,2
9,plateau,2


Summary:
* By removing stop words we reduced the number of words to 74.
* We can see the top words are related to Switzerland, indertnational and federal and the geographic area.