# Parsing Files

In this notebook we how to parse the contents of files. To demonstrate parsing lines of text, below there are four different lines that we can  turn into a list. Then we want to find the most frequently used word while excluding some others. We also want to ignore case. "The" is the same as "the".

This next cell of code will just set up our data for the problem we are solving.  We have all of our strings that we need to go through as well as our list of words to exclude.

In [None]:
#these are our lines
line_a = "The quick brown fox jumps over the lazy dog"
line_b = "Several fabulous dixieland jazz groups played with quick tempo"
line_c = "The wizard quickly jinxed the gnomes before they vaporized"
line_d = "I really like a brown dog, fabulous project groups, and quick foxes"

#these are the words to exclude
words_to_exclude = ['the', 'i', 'groups', 'quick']

# s1: create our list
list_of_lines = []

# s2: add lines to list
# three ways we can do it

# approach 1
#list_of_lines.append(line_a)
#list_of_lines.append(line_b)
#list_of_lines.append(line_c)
#list_of_lines.append(line_d)

# approach 2
list_of_lines = [line_a, line_b, line_c, line_d]

# approach 3
#we could also use list_of_lines.extend(line_a, ...)

# print list
print(list_of_lines)

So now that we have our list of lines, we want to divide each string or line into a list of words.  We will store this in a different list to preserve the original.

In [None]:
#will store all the data
list_data = []
for line in list_of_lines:
    #split on " "
    split_line = line.split(' ')
    print(split_line)
    

Ok.  We have some uppercase letters we have to worry about.  Also we have some commas (see the word dog and groups in Line 4).

Lets take care of the commas first.

In [None]:
for line in list_of_lines:
    #lets remove the comma first
    line = line.replace(',','')   # replace comma with empty string
    #split on " "
    split_line = line.split(' ')  # split the line at position with spaces
    print(split_line)

We can see that in the last line, dogs and groups now don't have commas.

Lets take care of the upper case issue

In [None]:
for line in list_of_lines:
    #lets remove the comma first
    line = line.replace(',','')
    #split on " "
    split_line = line.split(' ')
    #work on lowercase 
    split_line_lowercase = []
    for word in split_line:
        #make work lowercase
        word_lowercase = word.lower()
        #append it to the lowercase list
        split_line_lowercase.append(word_lowercase)
    print(split_line_lowercase)

So we step through each line making each word lowercase and adding it to its own list.  This seems like a lot of code but might be easier to follow.  

**To make it a little bit neater on one line of code we can use the following instead using `list comprehension`:**
```python
[word.lower() for word in line.split(' ')]
```
In  the above code `line.split(' ')` creates a list `for word in line.split(' ')` will traverse this list  `word.lower()` will make word lower case  `[  ]` is a list  

**We can also perform the task with `map`:**
```python
map(lambda x: x.lower(), line.split(' '))
```

In [None]:
for line in list_of_lines:
    #lets remove the comma first
    line = line.replace(',','')
    #split on " " and make lower
    split_line = [word.lower() for word in line.split(' ')]
    print(split_line)

Now that we have no commas and only one case (lowercase) to worry about let's put all this into `list_data`

In [None]:
list_data = []
for line in list_of_lines:
    #lets remove the comma first
    line = line.replace(',','')
    #split on " " and make lower
    split_line = [word.lower() for word in line.split(' ')]
    list_data.append(split_line)
print(list_data)

Notice that I'm printing often and breaking things into steps.  Keeping track of the data as we progress step by step will help us keep track of where we went wrong in the process if our data isn't what we want.  


Now that we have our list of lists the way we want we can begin counting the words. We wil be using a dictionary - (key,value) pairs. The key will be the word, the value will be the frequency of that word (an integer).

**Like lists that have an index (positions represented with integer), in a dictionary, the key (the word in our case) is the index.**

In [None]:
#create our dictionary
word_frequency = {}
#loop through the list_data (each line in our 'file')
for line in list_data:
    print(line)

We can see each line prints correctly. Now let's focus on printing each word in a line.

In [None]:
#loop through the list_data (each line in our 'file')
for line in list_data:
    #loop through each word in the line(list of words)
    for word in line:
        print(word)
    #this will put a space inbetween lines
    print(' ')

Now that we are looking at each word individually let's think about counting. Sometimes you can literally write code by thinking about it.  

**Algorithm:**
If the word is in the dictionary, increment it. Otherwise add it and make it 1.  

```python
if word in word_frequency:
    word_frequency[word] += 1
else:
    word_frequency[word] = 1
```

In [None]:
word_frequency = {}  # note that we have declared this container (i.e dictionary) outside the for loop

#loop through the list_data (each line in our 'file')
for line in list_data:
    #loop through each word in the line(list of words)
    for word in line:
        if word in word_frequency:
            word_frequency[word] += 1
        else:
            word_frequency[word] = 1
print(word_frequency)

Ok, so we see that 'the' has been used 4 times but it is on our list of exclude words. We can take care of the exclude words after or during the create dictionary process.

In [None]:
word_frequency = {}

#loop through the list_data (each line in our 'file')
for line in list_data:
    #loop through each word in the line(list of words)
    for word in line:
        #check to see if word is not in the list.
        if word not in words_to_exclude:
            if word in word_frequency:
                word_frequency[word] += 1
            else:
                word_frequency[word] = 1
        #else we just continue
print(word_frequency)

Most words have a frequency of 1, but we can see a few have a frequency of 2. Let's say we want to print the words 'brown', 'dog', and 'fabulous' that have the max frequency of 2.

We can go back through the dictionary and count or again we can keep track of the frequency as we add words to the dictionary.

In [None]:
word_frequency = {}
max_frequency = 0

#loop through the list_data (each line in our 'file')
for line in list_data:
    #loop through each word in the line(list of words)
    for word in line:
        #check to see if word is not in the list.
        if word not in words_to_exclude:
            if word in word_frequency:
                word_frequency[word] += 1
            else:
                word_frequency[word] = 1
            #make sure its a word we have added to the dictionary
            if max_frequency < word_frequency[word]:
                max_frequency = word_frequency[word]
        #else we just continue to the next word
print(f"==> (word,freq): {word_frequency}")
print(f"==> Max freq = {max_frequency}")

We now have the max frequency of 2, and a dictionary of words.
Let's make a new list of `most_frequent_words` and add all of the words with that frequency to the list.

In [None]:
most_frequent_words = []
for word in word_frequency:
    if max_frequency == word_frequency[word]:
        most_frequent_words.append(word)

print(f"The most frequent words = {most_frequent_words}")

# we can sort the words with `sort` method
most_frequent_words.sort()
print(f"The most frequent words (sorted) = {most_frequent_words}")

In this example we created a list of words, lowercase, excluding some of the common words in the sentences.
Keep this example at the front of your mind for the next file processing practices.

# Save your notebook, the `File > Close and Halt`