## Recitation 3

Welcome back! This recitation will cover the following topics:
- Sets
- Dictionaries
- File I/O
- Numpy

### Task 1

Bee Movie is a 2007 American computer-animated comedy film produced by DreamWorks Animation and starring Jerry Seinfeld. The film centers on Barry B. Benson (Seinfeld), a honey bee who tries to sue the human race for exploiting bees after learning from his florist friend Vanessa Bloome (Zellweger) that humans sell and consume honey. [Source](https://en.wikipedia.org/wiki/Bee_Movie)

Your bee studies class (BEE 210) has tasked you with homework telling you to break down the script of the movie. They have asked you to provide the following things:
- A text file giving every word that appears in the film, along with the frequency of each word (return a dictionary as well) **NOTE:** Try to replace as much punctuation as you can, this includes question marks, periods, and colons.
- Output the top ten words which appeared in the film along with their frequencies
- Output giving the number of times the characters: Bee Gandhi, Ralph Lauren, Larry King, and bejesus appear in the script. If a character did not appear in the script, catch the error and output a nice sentance telling the user that the character did not appear.

In [None]:
# code to download the bee movie text
!wget https://courses.cs.washington.edu/courses/cse163/21su/files/lectures/L04/bee-movie.txt

In [None]:
# Function that takes in a text file name and writes a 
# text file with the frequencies of each word also return the dictionary you create
def find_frequencies(file_name: str = "bee_frequencies.txt", script_file: str = "bee-movie.txt") -> dict:
  pass

#### Answer

```python

def find_frequencies(file_name: str = "bee_frequencies.txt", script_file: str = "bee-movie.txt") -> dict:
  freq_dict = {}

  with open(script_file) as f:
    for line in f:
      line = line.strip()
      line = line.replace(":", "").replace(".", "").replace("?", "").replace(")", "").replace("(", "").lower()
      words = line.split()

      for word in words:
        if word in freq_dict:
          freq_dict.update({word: freq_dict[word] + 1})
        else:
          freq_dict.update({word: 1})

  ## Option 1
  with open(file_name, "w+") as f:
    for word in freq_dict:
      f.write(word + ":" + str(freq_dict[word]))
      f.write('\n')

  ## Option 2
  freq_list = list(freq_dict.items())
  strip_quote = "'"
  freq_list = [f"{freq[0].strip(strip_quote)}:{str(freq[1])} \n" for freq in freq_list]
  with open(file_name, "w+") as f:
    f.writelines(freq_list)
  
  return freq_dict

freq_dict = find_frequencies()
```

**Q:** What is the difference between the two options?

In [None]:
# Function that takes in a text file of key/frequencies and 
# returns the top ten words that appear as a list
def find_top_ten(file_name: str = "bee_frequencies.txt") -> list:
  pass

#### Answer

```python

def find_top_ten(file_name: str = "bee_frequencies.txt") -> list:
    # Initialize an empty dictionary to store word frequencies
    word_frequencies = {}

    try:
        # Open the file for reading
        with open(file_name, 'r') as file:
            # Read each line in the file
            for line in file:
                # Split the line into key and frequency using a space as the delimiter
                key, frequency = line.strip().split(':')
                # Store the word and its frequency in the dictionary
                word_frequencies[key] = int(frequency)

    except FileNotFoundError:
        print(f"File '{file_name}' not found.")
        return []

    # Sort the dictionary by values (frequencies) in descending order
    sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)

    # Extract the top ten words from the sorted list
    top_ten_words = [word for word, frequency in sorted_words[:10]]

    return top_ten_words

top_words = find_top_ten()
print("Top Ten Words:", top_words)
```

**Q:** How can we avoid common words such as "the" and "of" from appearing? HINT: These are called stop words in NLP

In [None]:
# Function that takes in a list of words and a frequency 
# dictionary and outputs the frequency of each word
def find_words(to_find: list, freq_dict: dict):
  pass

#### Answer

```python
def find_words(to_find, freq_dict):
    
    for word in to_find:
        # Use the get() method to retrieve the frequency from the dictionary
        frequency = freq_dict.get(word, 0)
        if frequency > 0:
            print(f'{word} appeared {frequency} times in the script')
        else:
            print(f'The word: {word} did not appear in the script')

words_to_find = ["gandhi", "lauren", "bejesus", "larry"]
find_words(words_to_find, freq_dict)
```

## Task 2
**Open Ended**

You and your friend have created a new startup to motivate the Rutgers community, a noble cause! You guys want to broadcast a motivational quote every day to remind them of how great Rutgers is. Read in the quotes.txt file and choose a quote at random to broadcast.

<img src="https://media.tenor.com/UF6EuB04PRQAAAAd/rutgers-greg-schiano.gif" alt="meme" width="200"/>

In [None]:
!wget https://gist.githubusercontent.com/robatron/a66acc0eed3835119817/raw/0e216f8b6036b82de5fdd93526e1d496d8e1b412/quotes.txt

In [None]:
# Function that takes in a text file name and prints a 
# random line from said file
# NOTE: The quotes from this txt file do not represent Rutgers' or your TA's views in ANY way
def print_random_line(file_name: str = "quotes.txt"):
  pass

#### Answer
```python
import random

def print_random_line(file_name="quotes.txt"):
    try:
        # Open the file for reading
        with open(file_name, 'r') as file:
            # Read all lines from the file and store them in a list
            lines = file.readlines()
            
            if not lines:
                print(f"The file '{file_name}' is empty.")
            else:
                # Select a random line from the list
                random_line = random.choice(lines)
                # Print the random line
                print(random_line.strip())

    except FileNotFoundError:
        print(f"File '{file_name}' not found.")
```

## Task 3

Determine if a $9 x 9$ Sudoku board is valid. Only the filled cells need to be validated according to the following rules:

Each row must contain the digits 1-9 without repetition.

Each column must contain the digits 1-9 without repetition.

Each of the nine 3 x 3 sub-boxes of the grid must contain the digits 1-9 without repetition.

Note:
- A Sudoku board (partially filled) could be valid but is not necessarily solvable.
- Only the filled cells need to be validated according to the mentioned rules.

For example, this board below:

![example](https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Sudoku-by-L2G-20050714.svg/250px-Sudoku-by-L2G-20050714.svg.png)

is input as:

```text
board = 
[["5","3",".",".","7",".",".",".","."]
,["6",".",".","1","9","5",".",".","."]
,[".","9","8",".",".",".",".","6","."]
,["8",".",".",".","6",".",".",".","3"]
,["4",".",".","8",".","3",".",".","1"]
,["7",".",".",".","2",".",".",".","6"]
,[".","6",".",".",".",".","2","8","."]
,[".",".",".","4","1","9",".",".","5"]
,[".",".",".",".","8",".",".","7","9"]]
```

and is a valid board. Hint: look at the slicing section from the numpy lecture.

You're better off solving this question on the website I got it from [source](https://leetcode.com/problems/valid-sudoku/)

In [None]:
!pip3 install numpy

#### Answer

```python
import numpy as np

def is_valid_sudoku(board: list):
    board = np.array(board)
    
    # Check rows
    for row in board:
        if not is_valid_row(row):
            return False
    
    # Check columns
    for col in board.T:
        if not is_valid_row(col):
            return False
    
    # Check subgrids (3x3)
    for i in range(0, 9, 3):
        for j in range(0, 9, 3):
            subgrid = board[i:i+3, j:j+3]
            if not is_valid_row(subgrid.flatten()):
                return False
    
    return True

def is_valid_row(nums):
    seen = set()
    for num in nums:
        if num != '.' and num in seen:
            return False
        seen.add(num)
    return True

# Example usage:
board = [
    ["5","3",".",".","7",".",".",".","."],
    ["6",".",".","1","9","5",".",".","."],
    [".","9","8",".",".",".",".","6","."],
    ["8",".",".",".","6",".",".",".","3"],
    ["4",".",".","8",".","3",".",".","1"],
    ["7",".",".",".","2",".",".",".","6"],
    [".","6",".",".",".",".","2","8","."],
    [".",".",".","4","1","9",".",".","5"],
    [".",".",".",".","8",".",".","7","9"]
]

print(is_valid_sudoku(board))  # Should return True
```

## Task 4

Numpy is often used for tasks involving matrix manipulation such as linear regression. Read in the house prices csv file in pandas, drop the columns that don't have numerical data (we can work with this in a later date when we get more advanced with pandas). Then store the labels (price of the home) in a numpy array called `Y`, and the features (all the other info we know about the properties) in a different array called `X`. Finally use the following formula to find the weights (importances) to give to the features, once we have those we can make predictions.

$W = (X^{T} X)^{-1} X^{T} Y$

NOTE: No need to write a function for this, just write the logic in a codeblock

<img src="https://media.tenor.com/miz5HbQcXlYAAAAC/spongebob.gif" alt="meme" width="200"/>


In [None]:
!wget https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv
!pip3 install pandas

#### Answer

```python
import pandas as pd
import numpy as np

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv('house-prices.csv').drop(['Brick', 'Neighborhood', 'Home'], axis=1)

# Extract the labels (price of the homes) into a numpy array Y
Y = df['Price'].values

# Extract the features (all other numeric columns) into a different numpy array X
X = df.drop(columns=['Price']).values

# Perform linear regression using the provided formula
X_with_bias = np.column_stack((X, np.ones(len(X)))) # need to add a constant value (bias term) this is the "b" in y = mx + b
X_transpose = np.transpose(X_with_bias)
X_transpose_dot_X = np.dot(X_transpose, X_with_bias)
X_transpose_dot_Y = np.dot(X_transpose, Y)
weights = np.dot(np.linalg.inv(X_transpose_dot_X), X_transpose_dot_Y)

# The 'weights' variable now contains the importance (weights) to give to the features
df.head()
for attr, weight in zip(list(df.drop(columns=['Price']).columns), weights):
    print(f'{attr} has {weight} importance')
```

This example is taken from the dataset

```python
price = weights[0]*1790 + weights[1]*2 + weights[2]*2 + weights[3]*2
error = price - 114300

print(error)
```