# Basic scripting with Python

Using the corpus called 100-english-novels found on the cds-language GitHub repo, write a Python programme which does the following:

Calculate the total word count for each novel
Calculate the total number of unique words for each novel
Save result as a single file consisting of three columns: filename, total_words, unique_words


# General instructions

For this exercise, you can upload either a standalone script OR a Jupyter Notebook
Save your script as word_counts.py OR word_counts.ipynb
You can either upload the script/notebook here or push to GitHub and include a link - or both!
Your code should be clearly documented in a way that allows others to easily follow the structure of your script.
Similarly, remember to use descriptive variable names! A name like word_count is more readable than wcnt.

- s
# Purpose

This assignment is designed to test that you have a understanding of:

how to structure, document, and share a Python script;
how to effectively make use of native Python data structures, functions, and flow control;
how to load, save, and process text files.

In [1]:
# Importing all the necessary modules
import os
import re
import pandas as pd
from pathlib import Path

In [2]:
# Creating an object which contains the path to the text corpus - in this way the code can be used both for OS that utilizes backslash and forward slash
filepath = os.path.join("..", "data", "100_english_novels", "corpus")

# Creating empty lists, later to be appended to
filenames = []
n_words = []
n_unique_words = []

In [3]:
# Retrieving filename, number of words (n_words) and number of unique words (n_unique_words)
for file in Path(filepath).glob("*.txt"): # For each file in the filepath that ends with .txt, read the file into "text"
    with open(file, "r", encoding="utf-8") as file: 
        text = file.read()
        
        textname = re.sub('^.+\/', "", file.name) # Retrieve the name of the text (without the path)
        text = re.sub('[^A-Za-z0-9\ \s]+', '', text) # Delete all special characters within the text
        text_words = text.split() # Split the text into a list of words
        unique_text_words = list(set(text_words)) # Make a set of only the unique words
        
        # Append the filenames, the number of words, and the number of unique words to the empty lists.
        filenames.append(textname)
        n_words.append(len(text_words))
        n_unique_words.append(len(unique_text_words))

In [4]:
# Create a dict for the information we have retrieved, with what we see in red as the key.
data = {'filename': filenames, 'n_words': n_words, 'n_unique_words': n_unique_words}

In [5]:
# Create a pandas dataframe from the dictionary.
# Here key becomes name of column, while the value (list) becomes the values of the column
meta_data = pd.DataFrame(data)

In [6]:
# Create output path
outpath = os.path.join("..", "data", "100_english_novels", "meta_data.csv")

In [7]:
# Write the dataframe to outpath
meta_data.to_csv(outpath)