# Using Natural Language Processing to derive insights from BSc Data Science personal statements

This contains 10 BSc Data Science personal statements representating 50% of the cohort starting in 2022

Using NPL the following things will be identified from the personal statements:
- Main topics (Using Topic Modelling)
- Top 10 most frequent words - this may be more insightful than topic modelling as topic modelling is an unsupervised machine learning algorithm it is not always as effective as desired 
- Key people who have been mentioned - this could be used to provide a basis for potential students to do further research to explore their interests
- Readability levels - This will be calculated using the Flesch-Kincaid Grade Level Forumla which calculates the grade level needed to understand a text using the average syllables per word and the average number of words per sentence. 

In [44]:
import pandas as pd

# Reading the text file
with open('ST115_personal_statements.txt', 'r') as file:
    all_ps = file.read()

# Displaying the first 1000 characters of the text file
all_ps[:1000]

'Personal statement:\nTaking part in a coding competition and several UKMT challenges exposed me to new ways of approaching problems from what I was used to in school. The challenge of having to think differently about problems and eventually arrive at solutions after reaching dead-ends was something I relished doing, which convinced me that a quantitative degree was for me.\nI have continued to challenge myself and explore the interplay between computer science and mathematics by solving Project Euler problems. For example, I used recursion to compute the convergents of continued fractions of the square root of 2. These problems nurtured my interest in pure mathematics, leading me to read Kevin Houston\'s "How to Think like a Mathematician". This introduced me to more sophisticated inductive proof techniques than the ones I learned at A-level. To supplement this, I explored how Lean, a programming language, can be taught to inductively prove a simple addition from first principles. Wh

Looking at the first 1000 characters shown above it can be seen that some intial data cleaning must be done in the sense of removing '\n' and any backslashes in general.
Also we want to manipulate this text data in order to store all the different personal statements as one text to do topic modelling with and to store them as individual statements to calculate readability and key people.

In [46]:
#Inital removal of '\n' and backslashes
all_ps = all_ps.replace("\n", " ")
all_ps = all_ps.replace("\xa0", "")
all_ps = all_ps.replace("\'", "")

<b> STEP 1: Creating a dataframe fro the personal statements </b>
- Creating a dataframe where each row accounts for a single personal statement. 

In [48]:
# This phrase has been used in the text file to indicate the start of a different personal statement
individual_statements = all_ps.split('Personal statement:')

ps_df = pd.DataFrame({'Personal Statements': individual_statements})

#The first row is empty so it is removed
ps_df = ps_df.drop(index=0).reset_index(drop=True)


<b> STEP 2: Calculating Readability </b>

Creating the Flesch-Kincaid function to calculate the readability scores 

In [49]:
import re

def count_syllables(word):
    '''
    Helper function to count the number of sylables in a given word
    Input: 
    word : string
    
    Output:
    num_vowels : int
    '''
    num_vowels = len(re.findall(r'[aeiouy]+', word))

    # Subtract the number of silent e's at the end of the word
    if re.search(r'e$', word):
        num_vowels -= 1

    # Subtract one for each diphthong
    num_vowels -= len(re.findall(r'[aeiouy]{2}', word))

    # Add one if the word ends in "-le"
    if re.search(r'le$', word):
        num_vowels += 1

    # Add one if the word is one syllable and ends in a consonant followed by "y"
    if len(re.findall(r'^[^aeiouy]+[aeiouy]+[^aeiouy]+y$', word)):
        num_vowels += 1

    return max(1, num_vowels)

def flesch_kincaid_grade(text):
    '''
    The function uses the Flesch-Kincaid formula to calculate the grade level required to be able to read a specificed text
    
    Inputs: 
    text : string
    The text to calculate the readability level of 
    
    Outputs: 
    grade_level : float
    Grade age of readability 
    '''
    text = text
    sentences = text.split('.')
    words = text.split()

    # Calculate the average number of words per sentence
    words_per_sentence = len(words) / len(sentences)

    # Calculate the average number of syllables per word
    syllables = 0
    for word in words:
        syllables += count_syllables(word)
    syllables_per_word = syllables / len(words)

    grade_level = 0.39 * words_per_sentence + 11.8 * syllables_per_word - 15.59
    
    grade_level = round(grade_level, 1)

    return grade_level

Applying the function to all the personal statements

In [50]:
ps_df['Readability Age Grade'] = ps_df['Personal Statements'].apply(flesch_kincaid_grade)


In [51]:
ps_df

Unnamed: 0,Personal Statements,Readability Age Grade
0,Taking part in a coding competition and sever...,12.5
1,Mathematics and the social sciences are seen ...,14.5
2,Data Science appealed to me as it offers a n...,14.0
3,My interest in data analytics stems from an e...,13.3
4,"Growing up on the football pitch, I learnt t...",10.5
5,"I have always been drawn to figures and data,...",12.1
6,Data is one of the most valuable tools that t...,11.7
7,I couldn’t believe what I saw. After looking ...,14.0
8,I believe that the intersection of mathemati...,13.6
9,Seeing how influential big data is in shapin...,13.8


<b> STEP 3: NLP Preprocessing tasks</b> 

The following tasks will be carried out in this section: 
- Tokenisation
- Stop word removal
- Lemmatisation
- POS tagging
- TF-IDF
    

<b>Tokenisation:</b>

In [52]:
import nltk
import string
from nltk.tokenize import word_tokenize

ps_df['Tokens'] = ps_df['Personal Statements'].apply(lambda x: x.lower()).apply(word_tokenize)

ps_df['Tokens'] = ps_df['Tokens'].apply(lambda tokens: [token for token in tokens if token not in string.punctuation])


<b> Removing stop words </b>

In [53]:
from nltk.corpus import stopwords
sw = stopwords.words('English')

ps_df['Filtered Tokens'] = ps_df['Tokens'].apply(lambda tokens: [token for token in tokens if token not in sw])

Checking if there are any words that are still in tokens and not in the NLTK stopwords that should be taken out

In [55]:
full_token_dict = {}

for row in ps_df['Filtered Tokens']:
    for token in row:
        if token in full_token_dict: 
            full_token_dict[token] += 1
        else: 
            full_token_dict[token] = 1

sorted_dict = sorted(full_token_dict.items(), key=lambda x: x[1], reverse=True)

sorted_dict[:10]


[('data', 109),
 ('’', 42),
 ('learning', 41),
 ('science', 37),
 ('using', 24),
 ('machine', 23),
 ('used', 21),
 ('social', 17),
 ('mathematics', 16),
 ('models', 15)]

As can be seen the only tokens that were not in stop words that should be taken out are punctuation which were not caught by string.punctuation

In [56]:
#Taking out left over punctuation

ps_df['Filtered Tokens'] = ps_df['Filtered Tokens'].apply(lambda tokens: [token for token in tokens if token not in ['’', '‘', '``', "''", '“', '”']])

<b> Lemmatisation and POS Tagging </b>

In [66]:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

ps_df['Tagged Tokens'] = ps_df['Filtered Tokens'].apply(lambda x: pos_tag(x))


In [82]:
def get_wordnet_pos(tag):
    first_letter = tag[0]
    if first_letter == 'J': return wordnet.ADJ
    if first_letter == 'V': return wordnet.VERB
    if first_letter == 'N': return wordnet.NOUN
    if first_letter == 'R': return wordnet.ADV
    return None 

lemmatiser = WordNetLemmatizer()

def lemmatise(tokens):
    lemmatised_tokens = []
    for token, tag in tokens:
        pos = get_wordnet_pos(tag)
        if pos is None:
            lemma = lemmatiser.lemmatize(token)
        else:
            lemma = lemmatiser.lemmatize(token, pos)
        lemmatised_tokens.append(lemma)
    return lemmatised_tokens

ps_df['Lemmatised Tokens'] = ps_df['Tagged Tokens'].apply(lemmatise)



<b> TF-IDF </b>

In [99]:
from sklearn.feature_extraction.text import TfidfVectorizer

documents = []

for i in range(0, 10): 
    lemmatised_tokens = ps_df.loc[i]['Lemmatised Tokens']
    document = ' '.join(lemmatised_tokens)
    documents.append(document)
    
vectorizer = TfidfVectorizer()

tf_idf_matrix = vectorizer.fit_transform(documents)


<10x1447 sparse matrix of type '<class 'numpy.float64'>'
	with 2535 stored elements in Compressed Sparse Row format>

<b> STEP 4: Topic Modelling </b>

In [176]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

lda_model = LatentDirichletAllocation(n_components=5, random_state=42)

lda_model.fit(tf_idf_matrix)


In [177]:
feature_names = vectorizer.get_feature_names()
n_top_words = 10
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic #0:
data poverty social work mpi night relevant learn team science
Topic #1:
data use company business technology firm show math work end
Topic #2:
machine data model big learn theorem learning use increase practice
Topic #3:
data use learn machine network function problem science one understand
Topic #4:
science data communication lead level see social element value allocation


# Findings

<b> Findings from Topic modelling </b>

From this it can be seen that some of the main topics discussed in successful BSc Data Science personal statements include: 
- Applications of Data Science to the social sciences
- Work experience and data science in teh context of companies
- Machine learning models and learning about the theory behind it

<b> Top 10 Words</b>

In [181]:
#Top 10 words

lemma_dict = {}

for row in ps_df['Lemmatised Tokens']:
    for token in row:
        if token in lemma_dict: 
            lemma_dict[token] += 1
        else: 
            lemma_dict[token] = 1

top_words_dict = sorted(lemma_dict.items(), key=lambda x: x[1], reverse=True)

top_words_dict[:11]

[('data', 110),
 ('use', 58),
 ('learn', 45),
 ('science', 41),
 ('machine', 26),
 ('problem', 21),
 ('find', 21),
 ('model', 20),
 ('social', 17),
 ('allow', 17),
 ('think', 16)]

Using the top 10 lemmatised words is can be seen that key topics which are talked about are data science, machine learning, applications to social science. Whilst verbs such as 'use', 'learn' 'find', 'allow' and 'think' can indicate how students applying have engaged with these topics. 

Although it may seem like a large abstraction these key words show that successful BSc Data Science applicants have talked a lot about Data Science and Machine learning in their personal statements as well as the social issues they realte to. Despite this seeming like an obvious insight, it is useful to keep in mind for prospective applicants who may not only be applying to Data 