![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=Languages/MostUsedEnglishWords/most-used-english-words.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Most Used English Words

If you are learning to speak or read English, it is good to know which words are the most used.

We will check where words on a word list made by a reading teacher named [Edward Fry](https://books.google.ca/books/about/The_Reading_Teacher_s_Book_Of_Lists.html?id=XYidAAAAMAAJ&redir_esc=y) are on a word list from all [Google Books between the years 1800 and 2019](https://books.google.com/ngrams) that [someone made for us](https://github.com/hackerb9/gwordlist).

We will explain it using only the [1000 most common words in English](https://www.splasho.com/upgoer5) plus the word [list](https://www.thefreedictionary.com/list).

## Starting

Press the `▶▶` at the top to make this all work.

In [None]:
import pandas as pd
fry = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Languages/MostUsedEnglishWords/fry-sight-words.csv')
ngrams = pd.read_table('https://raw.githubusercontent.com/callysto/data-files/main/Languages/MostUsedEnglishWords/frequency-alpha-gcide.txt', delim_whitespace=True)
words = pd.merge(fry, ngrams, on='WORD', how='left')
words.drop(['COUNT','PERCENT','CUMULATIVE'], axis=1, inplace=True)
words

We see there are some words, like "workers" that show `NaN`. This means they are not in the Google list.

In [None]:
words[words['#RANKING'].isnull()]

## Important Parts of Words

There are 101 words from the Fry list that are not in the Google list. But we can check if the important part of a word is in the Google list.

In [None]:
important_part = 'worker'

row = ngrams[ngrams['WORD']==important_part].index[0]
print('The word "'+important_part+'" is number',int(ngrams.at[row, '#RANKING']),'in the Google list.')

So we need to find the important parts of those words and see if they are in the Google list.

In [None]:
import nltk
from nltk.stem.porter import PorterStemmer
important_part_finder = PorterStemmer()

important_parts = []
for row in words.fillna(value='fix').itertuples():
    if row._3 == 'fix':
        important_parts.append(important_part_finder.stem(row.WORD))
    else:
        important_parts.append(row.WORD)
fry['ORIGINAL_WORD'] = fry['WORD']
fry['WORD'] = important_parts
words = pd.merge(fry, ngrams, on='WORD', how='left')
words.drop(['COUNT','PERCENT','CUMULATIVE'], axis=1, inplace=True)
print('Now there are',len(words[words['#RANKING'].isnull()]),'words not found.')

We can look at the words that are not found.

In [None]:
words[words['#RANKING'].isnull()]

## Taking Out Words

Of those 57 words, most are [contractions](https://www.thefreedictionary.com/contraction) (like `we'll`), [nationalities](https://www.thefreedictionary.com/nationality) (like `Greek`), or places (like `Washington`). We could go through and change them, but instead we will just take them out.

In [None]:
words.dropna(inplace=True)
words

## Coloured Lines

Now we can see how well the `#RANKING` order from Google matches up with the `LIST` order from Fry.

In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=10, cols=1)
for fry_list in words['LIST'].unique():
    df = words[words['LIST']==fry_list]
    fig.append_trace(go.Scatter(x=df.index,y=df['#RANKING'],name=str(fry_list),hovertext=df['WORD']),row=int(fry_list/100),col=1)
fig.update_layout(height=1200,legend_title_text='Fry List',title_text='Google List and Fry List Match (shorter is better match)')
fig.update_yaxes(title='Google').update_xaxes(title='Fry', row=10)
fig.show()

We see that there are some words like `spell` and `numeral` that are used often in the Fry list but not so much in the Google list. This may be because the Fry list is from books for Grades 3 to 9, and the Google list is from as many books as they could scan.

It is possible to see the words that are not a good match between lists because they are later in the Google list.

In [None]:
google_max = 3000
words[words['#RANKING']>google_max]

We can make the coloured lines again without those words.

In [None]:
google_max = 3000
fig = make_subplots(rows=10, cols=1)
for fry_list in words['LIST'].unique():
    df = words[(words['LIST']==fry_list) & (words['#RANKING']<google_max)]
    fig.append_trace(go.Scatter(x=df.index,y=df['#RANKING'],name=str(fry_list),hovertext=df['WORD']),row=int(fry_list/100),col=1)
fig.update_layout(height=1200,legend_title_text='Fry List',title_text='Google List (under '+str(google_max)+') and Fry List Match')
fig.update_yaxes(title='Google').update_xaxes(title='Fry', row=10)
fig.show()

## Ending

We checked two "most used" word lists to see how they were the same. There were some words that were much higher on one list, but most were a pretty good match.

You can also check out other word lists like the [Odgen's Basic English Vocabulary](http://ogden.basic-english.org/words.html), [Spell Checker Oriented Word Lists](http://wordlist.aspell.net), the [General Service List](http://jbauman.com/aboutgsl.html), or the [Wiktionary Frequency Lists](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists).

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)