# Question 1

**Write a function**
- The function should
    - Take in a string
    - Split the string on spaces
    - Remove a few punctuation marks (.,?!)
    - Make the text lowercase
    - Return the most common word in the string
- Input a test string (‘I bought a sandwich with a side of chips!’) into the function and print the output

In [11]:
import Pkg; Pkg.add("DataStructures")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.9/Manifest.toml`


In [21]:
using DataStructures

function find_most_common_word(text::String)
    # Split the string on spaces
    words = split(text)
    
    # Remove punctuation marks and make each word lowercase
    words = map(w -> lowercase(replace(w, r"[,.!?]" => "")), words)
    
    # Count the frequency of each word
    word_counts = counter(words)
    
    # Find the word with the maximum frequency
    freq, most_common_word = findmax(word_counts)
    
    return most_common_word
end

# Test the function with a given string
test_string = "I bought a sandwich with a side of chips!"
println("The most common word is: ", find_most_common_word(test_string))

The most common word is: a


# Question 2

### Part 1
Read in the movie_reviews.csv file
- Print the number of rows in the data set

In [26]:
#Pkg.add("DataFrames")
#Pkg.add("CSV")
using DataFrames
using CSV

In [64]:
# Read in the movie_reviews.csv file
df = CSV.read("movie_reviews.csv", DataFrame)

# Print the number of rows in the data set
println("Number of Rows: $(size(df, 1))")

Number of Rows: 16638


### Part 2
Filter the data to only contain “popular” movies that were released in
theatres before 2010 (popular = movies with more audience reviews than
the average number of audience reviews of all movies before 2010)
- Print the number of rows in the output

In [72]:
#Pkg.add("Dates")
using Dates, Statistics

# Omit rows with missing dates
df = dropmissing(df, :in_theaters_date)

# Function to decide the correct century
function add_century(year::Int)
    # If the year is less than 50, assume it's 2000's; otherwise, it's 1900's
    return year < 19 ? 2000 + year : 1900 + year
end

# Parse the dates and adjust the years
df[!, :in_theater_year] = [add_century(year(d)) for d in Date.(df[!, :in_theaters_date], DateFormat("m/d/y"))]

# Filter the DataFrame to only contain rows where the movie was released before 2010
df_before_2k10 = filter(row -> row[:in_theater_year] < 2010, df)

# Print the number of rows in the data set before 2010
println("Number of Rows before 2010: $(nrow(df_before_2k10))")

# Calculate the mean audience count for movies before 2010
mean_audience_count = float(mean(skipmissing(df_before_2k10[!, :audience_count])))

# Filter the df to contain only popular movies, i.e., those with more audience reviews than the average
df_pop_2k10 = filter(row -> !ismissing(row[:audience_count]) && row[:audience_count] > mean_audience_count, df_before_2k10)

# Print the number of rows of popular movies before 2010
println("Number of popular movies Rows before 2010: $(nrow(df_pop_2k10))")

Number of Rows before 2010: 10135
Number of popular movies Rows before 2010: 1030


### Part 3
Using the filtered data, display the percent of movies that fall under each type of rating (R, PG-13, etc.)
- Share an insight from the summary table

In [80]:
rating_counts = combine(groupby(df_pop_2k10, :rating, skipmissing=true), nrow => :rating_count)

# Calculate the total count of movies; skipmissing is used to ensure missing values don't affect the sum
total_movie_count = sum(skipmissing(rating_counts[!, :rating_count]))

# Calculate the percentage of each rating count
rating_counts[!, :rating_perc] = round.((rating_counts[!, :rating_count] ./ total_movie_count) .* 100, digits=2)

rating_counts

Row,rating,rating_count,rating_perc
Unnamed: 0_level_1,String7,Int64,Float64
1,PG,200,19.42
2,R,336,32.62
3,NR,2,0.19
4,G,66,6.41
5,PG-13,426,41.36


### Part 4

Engineer a new feature from any of the existing columns
- Create a summary table using the new feature
- Share an insight from the summary table

#### Feature 1

**Runtime Categories:** 

Movies are categorized into 
- 'Short' (less than 90 minutes)
- 'Medium' (90 to 120 minutes)
- 'Long' (more than 120 minutes). 

This categorization can help analyze trends across different movie durations.

**Insight:**
- Most movies are of medium length (90-120 minutes), indicating this as a preferred movie length. 
- Short and long movies are less common, suggesting specific audience or artistic choices.

In [84]:
# Define the categorize_runtime function that maps a runtime to a category
function categorize_runtime(runtime)
    if ismissing(runtime) || runtime < 0
        return "Unknown"  # Handling missing or invalid data
    elseif runtime < 90
        return "Short"
    elseif runtime <= 120
        return "Medium"
    else
        return "Long"
    end
end

# Apply the function to each row in the runtime_in_minutes column
# to create the runtime_category column (Assuming 'runtime_in_minutes' is the column name in your CSV)
df.runtime_category = map(categorize_runtime, df.runtime_in_minutes)

# Generate the summary table by counting the number of occurrences of each runtime category
runtime_summary = combine(groupby(df, :runtime_category), nrow => :count)

# Display the summary table
println(runtime_summary)

[1m4×2 DataFrame[0m
[1m Row [0m│[1m runtime_category [0m[1m count [0m
     │[90m String           [0m[90m Int64 [0m
─────┼─────────────────────────
   1 │ Short              3202
   2 │ Medium            10384
   3 │ Long               2112
   4 │ Unknown             125


#### Feature 2

**Genre Count:**
Counts the number of genres associated with each movie. This feature could provide insights into the complexity or audience appeal based on the variety of genres.

**Insights**
- G' rated movies have the highest average number of genres, indicating these movies might be targeting a broader audience, including families and children, with diverse themes.
- PG-13 and R-rated movies have a similar average genre count, suggesting a consistent approach in targeting their respective audiences.

In [88]:
function count_genres(genre_str)
    if ismissing(genre_str)
        return missing  # Return missing if the genre string is missing
    else
        return length(split(genre_str, ", "))  # Split on ", " and count the genres
    end
end

# Generate "Genre Count" feature, handling missing values
df[!, :genre_count] = map(count_genres, df[!, :genre])

# Summary Table for Genre Count (grouped by rating)
genre_count_summary = combine(groupby(df, :rating), 
                              :genre_count => x -> mean(skipmissing(x)))

# Display the summary table
println(genre_count_summary)


[1m6×2 DataFrame[0m
[1m Row [0m│[1m rating  [0m[1m genre_count_function [0m
     │[90m String7 [0m[90m Float64              [0m
─────┼───────────────────────────────
   1 │ PG                    2.51918
   2 │ R                     2.10727
   3 │ NR                    2.31221
   4 │ G                     3.12934
   5 │ PG-13                 2.0951
   6 │ NC17                  2.71429


# Question 3
Explore an external library
- Import a library that wasn’t covered in class
- Explain how it works
- Show an example of the library in action

For this Question, I will be exploring HTTP.jl package

### Description of HTTP.jl:
HTTP.jl is a Julia library for handling HTTP requests and responses. It provides a range of functionalities to facilitate both the client-side and server-side of HTTP communications.

### How It Works ? 
- On the client-side, HTTP.jl allows users to send HTTP requests to servers and handle the responses. Various HTTP request methods like GET, POST, PUT, DELETE, etc can be performed using this. It provides functions to customize headers, handle authentication, and manage cookies.
- On the server-side, HTTP.jl enables us to build a web server that can listen for HTTP requests and send responses back to clients. We can define endpoints, set up routing, and handle different HTTP methods. It also provides middleware functionality for tasks like logging and session management.

### Primary Methods and Functions:
- **HTTP.request:** General function for making HTTP requests.
- **HTTP.get, HTTP.post, HTTP.put, HTTP.delete:** Specific functions for making HTTP GET, POST, PUT, and DELETE requests, respectively.
- **HTTP.serve:** Function to start a server and listen for requests.
- **HTTP.Response:** A type that represents an HTTP response, including status, headers, and body.
- **HTTP.listen:** A lower-level function for handling incoming connections.

### Example 1: Making a GET Request

This example sends a GET request to httpbin.org, which is a service for testing HTTP requests and responses. It then prints out the status code and the body of the response.

In [91]:
#Pkg.add("HTTP")
using HTTP

# Send a GET request to an API endpoint
response = HTTP.get("https://httpbin.org/get")

# Display the status code and body of the response
println("Status Code: ", response.status)
println("Response Body: ", String(response.body))

Status Code: 200
Response Body: {
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTP.jl/1.9.4", 
    "X-Amzn-Trace-Id": "Root=1-6568fab4-29552fe13993c27e473c205a"
  }, 
  "origin": "76.150.189.181", 
  "url": "https://httpbin.org/get"
}



### Example 2: Sending a POST Request with Data

In this example, a POST request is sent with a JSON body containing some data. The Content-Type header is set to application/json.

In [94]:
#Pkg.add("JSON")
using HTTP
using JSON

# Data to be sent in the request body
data = Dict("name" => "Julia", "language" => "JuliaLang")

# Convert data to JSON
json_data = JSON.json(data)

# Send a POST request with JSON data
response = HTTP.post("https://httpbin.org/post", body=json_data, headers=Dict("Content-Type" => "application/json"))

# Display the status code and body of the response
println("Status Code: ", response.status)
println("Response Body: ", String(response.body))

Status Code: 200
Response Body: {
  "args": {}, 
  "data": "{\"name\":\"Julia\",\"language\":\"JuliaLang\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip", 
    "Content-Length": "39", 
    "Content-Type": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTP.jl/1.9.4", 
    "X-Amzn-Trace-Id": "Root=1-6568fae6-4388cca95545ecd24e5e10f3"
  }, 
  "json": {
    "language": "JuliaLang", 
    "name": "Julia"
  }, 
  "origin": "76.150.189.181", 
  "url": "https://httpbin.org/post"
}

