| Name (Last, First) | Student ID |           Section contributed           |   Section edited   |            Other contribution                |
|:----------------:-  |-:------:- |:-------------------------------------:-|  :----------------:  |:------------------------------------------: 
| 
Natanael, Edward | d 30140286 | n GitHub setup/synci, function write up | code and write up a  Troubleshoot FreqDist, GitHub merge conflict   |
| Supangat, Jonathan 301416826    Data Collection, function write up      | code and write up    Communication and task coordinationc        d| |

External sources file URL:
* Ghostbuster : https://www.scifiscripts.com/scripts/Ghostbusters.txt
* Middlemarch : cleaned metadata from lab 4
* Newspaper (EV Cars): https://www.cnn.com/interactive/2019/08/business/electric-cars-audi-volkswagen-tesla/, https://www.cnn.com/2019/08/01/cars/future-of-electric-car-charging/index.html, https://www.cnn.com/2019/08/07/business/ford-ceo-hackett-elon-musk-table-interview/index.html

**Importing Libraries**

In [2]:
import nltk
from nltk.corpus import names
# import the NLTK packages we know we need
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

import random
import numpy as np
import os
import pandas as pd

In [4]:
# import spaCy and the small English language model
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 2000000

# this does prettier displays on spaCy
from spacy import displacy

In [7]:
#nltk.download('all')

**Defining Functions**

In [6]:
def get_text_info(text):
    """
    Uses NLTK to calculate: tokens, types, lexical diversity, Top 10 Frequent words
    
    Args:
        text (str): a string containing the file or text
        
    Returns: 
        dict: a dictionary containing tokens, types, and lexical diversity, Top 10 Frequent words
    """
    tokens = nltk.word_tokenize(text.lower()) # Convert to lowercase for easy analysis
    n_tokens = len(tokens)
    n_types = len(set(tokens))
    # counting lexical diversity
    lex_div = n_types/len(tokens)
    
    # Create frequency distribution, without including punctuation
    words = [word for word in tokens if word.isalnum()]
    freqd = FreqDist(words) # punctuation removed
    # Get the Top 10 used words 
    top10 = freqd.most_common(10)
    return {
            'number of tokens': n_tokens,
            'types': n_types,
            'lexical diversity': lex_div,
            'Top 10 Freq Words': top10, 
        }


def process_dir(path):
    """
    Reads all the files in a directory. Processes them using the 'get_text_info' function
    
    Args: 
        path (str): path to the directory where the files are
        
    Returns:
        dict: a dictionary with file names as keys and the tokens, types, lexical diversity, and top 10 frequent words as values
    
    """
    file_info = {}

    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                file_info[filename] = get_text_info(text)
    return file_info

def listentity(path):
    '''
    Reads all the files in a directory. Process them using SpaCy to extract list of named entities along with their labels, if present.
    Args: 
        path (str) : path to the directory where the files are
    
    Retruns: data frame of named entities along with its label and filename for the three files in 'data' folder
    
    '''
    # Creating a dictionary to store named entities for each of the files from path dir
    named_entities = []
    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
            # Defining NLP object   
            doc = nlp(text) # Defining space to store spacy object for named entities

            # go through the entities and append each to the list
            for ent in doc.ents:
                named_entities.append((filename,ent.text, ent.label_))

            # create a df for the entities, from the list above 
            df_ents = pd.DataFrame(named_entities)
            # name the columns
            df_ents.columns = ['Filename','Entity', 'Label']
    return df_ents

**Reading Files**

In [8]:
# define the path. This directory should have more than 1 file
path = './rawdata'

files_in_dir_info = process_dir(path)

**Analyzing Contents I**

The analysis below containing each of the three '.txt' file in the 'data' directory:
* Number of tokens
* Number of types
* Lexical Diversity
* Top 10 Frequent used words

In [10]:
files_in_dir_info

{'EVCars.txt': {'number of tokens': 6038,
  'types': 1483,
  'lexical diversity': 0.2456111295130838,
  'Top 10 Freq Words': [('the', 278),
   ('to', 149),
   ('and', 123),
   ('of', 122),
   ('a', 105),
   ('in', 84),
   ('that', 73),
   ('is', 68),
   ('s', 67),
   ('electric', 62)]},
 'Ghostbusters.txt': {'number of tokens': 29380,
  'types': 4223,
  'lexical diversity': 0.14373723621511234,
  'Top 10 Freq Words': [('the', 1692),
   ('and', 675),
   ('a', 615),
   ('to', 532),
   ('of', 440),
   ('venkman', 417),
   ('i', 372),
   ('you', 367),
   ('it', 323),
   ('in', 278)]},
 'Middlemarch.txt': {'number of tokens': 373877,
  'types': 19342,
  'lexical diversity': 0.051733591528764805,
  'Top 10 Freq Words': [('the', 12679),
   ('to', 10158),
   ('of', 8980),
   ('and', 8497),
   ('a', 7257),
   ('in', 5287),
   ('that', 5056),
   ('he', 4943),
   ('was', 4609),
   ('i', 4589)]}}

In [12]:
df = pd.DataFrame.from_dict(files_in_dir_info, orient='index')
df

Unnamed: 0,number of tokens,types,lexical diversity,Top 10 Freq Words
EVCars.txt,6038,1483,0.245611,"[(the, 278), (to, 149), (and, 123), (of, 122),..."
Ghostbusters.txt,29380,4223,0.143737,"[(the, 1692), (and, 675), (a, 615), (to, 532),..."
Middlemarch.txt,373877,19342,0.051734,"[(the, 12679), (to, 10158), (of, 8980), (and, ..."


**Analyzing Content II**

Below is the function call to display the 3 files named entities and their labels, if present.

In [14]:
# Calling function to display the dataframe containing list of named entities and their labels
named_entity_df = listentity(path)

In [16]:
# Check whether all three files present in the data frame 
np.unique(named_entity_df.Filename, return_counts=True)

(array(['EVCars.txt', 'Ghostbusters.txt', 'Middlemarch.txt'], dtype=object),
 array([  463,  1358, 12506], dtype=int64))

In [18]:
print(named_entity_df)

              Filename              Entity     Label
0           EVCars.txt            Brussels       GPE
1           EVCars.txt             Germany       GPE
2           EVCars.txt    Volkswagen Group       ORG
3           EVCars.txt          Thirty-six  CARDINAL
4           EVCars.txt               dozen  CARDINAL
...                ...                 ...       ...
14322  Middlemarch.txt              eBooks       ORG
14323  Middlemarch.txt   Project Gutenberg       ORG
14324  Middlemarch.txt  Archive Foundation       ORG
14325  Middlemarch.txt              eBooks       ORG
14326  Middlemarch.txt              eBooks       ORG

[14327 rows x 3 columns]


**Short Reflection**

a) What do the most frequent words tell you about each of the 3 genres? 
-> Based on the most frequent words results, we could see that each of the 3 genres have similar frequent usage of words between them. The most notable difference would be 'Venkman' for Ghostbusters script and 'Electric' for EV Cars Newspaper article. Based on this, we feel that this provide a differentiation between the content, audience, and mode of delivery that each of the 3 genres have.

b) What do the named entities tell you about each of the 3 genres?
-> The named entities output that we got for each of the 3 genres really showed the differentiation of the 3 genres more than the most frequent words. Because of the correct named entities and their correct labeling, we could see that EVCars.txt for example to be a newspaper article as it has unique named entity like Volkswagen Group (ORG labeled) for example showing that it is a vehicle organization group. And as for Middlemarch.txt we could also know that it is a book as provided details from the named entity of eBooks (ORG Labeled) showing that it is a electronic books organization.  

c) Is the named entity output correct, in your opinion? If not, why not?
-> Yes, we would say that the named entity output of our assignment is correct. Because it shows the different entities that is available from each of the 3 genres and provided correct labeling. Where for example, in EV Cars Newspaper article it has Brussels, Volkswagen Group, and Dozen entities and they are labeled correctly. This is important because from having the correct named entity output and labeling we could see the differentiation between the 3 genres. 

d) How do you interpret the lexical diversity?
-> EV Cars Newspaper article has a lexical diversity of 0.245611, which indicates that they have a higher lexical diversity than the other 2 genres. Therefore, it suggests that this genre has more unique words which is as expected because this genre use more technical terms and a richer vocabulary as it is a newspaper article about electric vehicles. 
-> Ghostbusters movie script has a lexical diversity of 0.143737, which indicates that they have a lower lexical diversity than the EV Cars Newspaper article but still higher than the Middlemarch book. Therefore, the lexical diversity is quite moderate and it suggests that this genre is dialogue heavy and have repetition of function words as it is expected in movie scripts in general. 
-> Middlemarch book has a lexical diversity of 0.051734, which indicates that this genre have the lowest lexical diversity than the other 2 genres. Therefore, it suggests that this genre has a very long story with a high repetition use of common words and phrases. Which is also as expected in the books genre in general. 