In [1]:
import pandas as pd
import numpy as np
import datetime
import random
import json

The output of your code should be the original DataFrame, but with missing values in the 'Revenue' column filled in as described, a new 'Rating Category' column, and the mean revenue for each 'Rating Category'. The mean revenue should be output as a pandas Series.

In [2]:
#movies_df = pd.read_clipboard(sep=',')
#movies_df.to_csv("movies_sample.csv", index=False)
movies_df = pd.read_csv("movies_sample.csv")

In [3]:
median_revenue = movies_df['Revenue'].median()

In [4]:
movies_df.Revenue.fillna(median_revenue, inplace=True)

In [5]:
def rate_films(x):
    if x >= 8.5:
        return "Excellent"
    elif x >= 7:
        return "Very Good"
    elif x >= 5.5:
        return "Good"
    else:
        return "Average"


movies_df["Rating Category"] = movies_df["Rating"].apply(rate_films)
movies_df

Unnamed: 0,Title,Genre,Year,Runtime,Rating,Revenue,Rating Category
0,Jurassic World,Action,2015,124,7.0,652.27,Very Good
1,Manchester by the Sea,Drama,2016,137,7.9,47.7,Very Good
2,The Circle,Thriller,2017,110,5.3,20.48,Average
3,The Avengers,Action,2012,143,8.1,623.28,Very Good
4,Toy Story 3,Animation,2010,103,8.3,414.98,Very Good
5,John Wick,Action,2014,101,7.2,43.0,Very Good
6,The Shape of Water,Fantasy,2017,123,7.4,63.86,Very Good
7,Inside Out,Animation,2015,95,8.2,356.45,Very Good
8,Frozen,Animation,2013,102,7.5,400.74,Very Good


In [7]:
movies_df.groupby("Rating Category")[['Revenue']].mean()

Unnamed: 0_level_0,Revenue
Rating Category,Unnamed: 1_level_1
Average,20.48
Very Good,325.285


In [8]:
### Generated solution
import pandas as pd
movies_df['Decade'] = movies_df['Year'].apply(lambda x: f"{x//10*10}_{x//10*10+9}")
movies_df['Revenue'] = movies_df.groupby(['Decade', 'Genre'])['Revenue'].transform(lambda x: x.fillna(x.median()))
movies_df['Rating Category'] = pd.cut(movies_df['Rating'], bins=[0,5.5,7,8.5,10], labels=['Average','Good','Very Good','Excellent'])
mean_revenue = movies_df.groupby('Rating Category').mean()['Revenue']


for the revenue null value fills, the generated answer makes a great point:
 - The expression {movies_df.groupby(['Decade', 'Genre'])['Revenue'].transform(lambda x: x.fillna(x.median()))} is an example of a more sophisticated method of data imputation, which takes into account the context of the missing data.

- In this case, you're filling in missing 'Revenue' values with the median revenue of movies that are in the same genre and were released in the same decade. This makes sense because movies from the same genre and decade are more likely to have similar revenues compared to movies from a different genre or time period. This method provides a more accurate estimate of the missing values.

- On the other hand, median_revenue = movies['Revenue'].median() calculates the median revenue across all movies, regardless of their genre or release date. If you were to fill in missing values with this median, you'd be ignoring the context in which the data is missing. For example, you'd be treating a drama movie from the 1980s the same as an action movie from the 2020s, even though these types of movies might have very different revenues.

- Therefore, the first method is generally better because it's more likely to provide an accurate estimate of the missing data. However, the best method to use always depends on the specific dataset and problem at hand.

for the rating category, I also learned a bunch.
- movies_df['Rating Category'] = pd.cut(movies_df['Rating'], bins=[0,5.5,7,8.5,10], labels=['Average','Good','Very Good','Excellent'])
- This line of code is using the pd.cut() function to create a new column in the DataFrame called 'Rating Category'. The pd.cut() function is a way to create categories (or "bins") based on numeric values. In this case, it's being used to categorize the 'Rating' column into different groups based on the rating score.

- The bins argument is specifying the boundaries for each category. The list [0,5.5,7,8.5,10] means that the categories are:

- This line of code is essentially mapping each movie's rating to a category. It's an example of "binning" or "bucketing", which are common techniques used in data analysis and machine learning to deal with continuous variables. In this case, it's being used to simplify the 'Rating' column and make it easier to analyze.

### You are given a list of dictionaries where each dictionary represents a movie. Each dictionary has the following key-value pairs:

- 'Title': The title of the movie (string).
- 'Genre': The genre of the movie (string).
- 'Year': The year the movie was released (integer).
- 'Runtime': The length of the movie in minutes (integer).
- 'Rating': The average user rating out of 10 (float).
Your task is to write a Python function that takes in this list and a genre, and returns the average rating of movies in that genre.

In addition, write an SQL query that would perform the same operation on a table with the same columns.

### Libraries Needed

- Python: None
- SQL: None (but you need to know SQL syntax)
Inputs

### The Python function average_rating_by_genre(movies: List[dict], genre: str) -> float: takes in two arguments:

- movies: a list of dictionaries where each dictionary represents a movie with the key-value pairs described above.
- genre: a string representing the genre of the movies for which you want to calculate the average rating.
The SQL query should be written assuming you have a table named movies with the same columns as the dictionaries in the Python function.

## Expected Outputs

The Python function should return a float representing the average rating of movies in the input genre.

The SQL query should return a single row with a single column (which you can call 'Average Rating') that represents the average rating of movies in the specified genre.



In [9]:
movies = [
    {'Title': 'Jurassic World', 'Genre': 'Action', 'Year': 2015, 'Runtime': 124, 'Rating': 7.0},
    {'Title': 'Inside Out', 'Genre': 'Animation', 'Year': 2015, 'Runtime': 95, 'Rating': 8.2},
    {'Title': 'Toy Story 3', 'Genre': 'Animation', 'Year': 2010, 'Runtime': 103, 'Rating': 8.3},
    {'Title': 'John Wick', 'Genre': 'Action', 'Year': 2014, 'Runtime': 101, 'Rating': 7.2},
    {'Title': 'The Circle', 'Genre': 'Thriller', 'Year': 2017, 'Runtime': 110, 'Rating': 5.3},
    {'Title': 'Manchester by the Sea', 'Genre': 'Drama', 'Year': 2016, 'Runtime': 137, 'Rating': 7.9},
    {'Title': 'The Avengers', 'Genre': 'Action', 'Year': 2012, 'Runtime': 143, 'Rating': 8.1},
    {'Title': 'Frozen', 'Genre': 'Animation', 'Year': 2013, 'Runtime': 102, 'Rating': 7.5},
    {'Title': 'The Shape of Water', 'Genre': 'Fantasy', 'Year': 2017, 'Runtime': 123, 'Rating': 7.4},
]

In [26]:
genre_ratings=[]
def average_by_genre(dictionary, genre):
    for movie in dictionary:
        if movie['Genre'] == genre:
            genre_ratings.append(movie['Rating'])
    return(round(sum(genre_ratings) / len(genre_ratings),2))

In [27]:
average_by_genre(movies, 'Fantasy')

7.4

In [28]:
for genre in ['Action', 'Animation', 'Drama', 'Fantasy', 'Thriller']:
    print(f"The average rating of {genre} movies is {average_by_genre(movies, genre)}")

The average rating of Action movies is 7.43
The average rating of Animation movies is 7.67
The average rating of Drama movies is 7.7
The average rating of Fantasy movies is 7.67
The average rating of Thriller movies is 7.43


If I did this in SQL on a table called Movies, I would use the following query:

    SELECT Genre, AVG(Rating) AS Avg_Rating, COUNT(*) AS Num_Ratings

    FROM Movies

    GROUP BY Genre

    ORDER BY Avg_Rating DESC;

### You are tasked with implementing a MovieDatabase class. The class should be initialized with a list of movies, where each movie is a dictionary with the following key-value pairs:

- 'Title': The title of the movie (string).
- 'Genre': The genre of the movie (string).
- 'Year': The year the movie was released (integer).
- 'Runtime': The length of the movie in minutes (integer).
- 'Rating': The average user rating out of 10 (float).
The MovieDatabase class should have a method average_rating_by_genre(self, genre: str) -> float: which returns the average rating of movies in the specified genre.

### Libraries Needed

- Python: None
### Inputs

The MovieDatabase class should be initialized with a movies: List[dict] argument, where movies is a list of dictionaries where each dictionary represents a movie with the key-value pairs described above.

The average_rating_by_genre method should take in a single argument:

- genre: a string representing the genre of the movies for which you want to calculate the average rating.
Expected Outputs

The average_rating_by_genre method should return a float representing the average rating of movies in the input genre.



In [30]:
class Movie:
    def __init__(self, dictionary):
        self.title = dictionary['Title']
        self.genre = dictionary['Genre']
        self.year = dictionary['Year']
        self.runtime = dictionary['Runtime']
        self.rating = dictionary['Rating']

<__main__.Movie at 0x7fbe97d16160>

In [35]:
movie2 = Movie(movies[2])
movie2.genre

'Animation'

In [38]:
movies_as_objects = []

for i in movies:
    movies_as_objects.append(Movie(i))

In [42]:
movies_as_objects[0].rating

7.0

Generated solution:

In [None]:
class MovieDatabase:
    def __init__(self, movies: List[dict]):
        self.movies = movies

    def average_rating_by_genre(self, genre: str) -> float:
        genre_movies = [movie for movie in self.movies if movie['Genre'] == genre]
        average_rating = sum(movie['Rating'] for movie in genre_movies) / len(genre_movies)
        return average_rating

This took a different approach than what I ended up making, as it created a more comprehensive class that could be used to do more than just find the average rating by genre. I think this is a good approach. I also like how the class is initialized with a list of dictionaries, which is the same format as the original data.

### Question:

You are given a small sample of a larger dataset, represented as a string that can be read into a pandas DataFrame using the `pd.read_clipboard()` function. The dataset represents sales data for a retail store and includes four columns: 'Product', 'Date', 'Sales', and 'Profit'.




Your task is to:

1. Read the data into a DataFrame.
2. Convert the 'Date' column to datetime type.
3. Replace any non-numeric characters in the 'Profit' column, then convert it to a numeric type.
4. Calculate the total sales and profit for each product, and store the result in a new DataFrame.

### Libraries Needed:

- pandas
- numpy

### Inputs:

- A string representing the dataset.

### Expected Outputs:

- A DataFrame representing the cleaned dataset.
- A DataFrame representing the total sales and profit for each product.


In [12]:
# Importing libraries
import pandas as pd
import numpy as np

# Reading the data
data = """
Product	Date	Sales	Profit
Printer	2022-05-17	12	$100
Laptop	2022-05-18	10	$150
Printer	2022-05-19	8	$75
Monitor	2022-05-20	15	$120
Laptop	2022-05-21	7	$100
"""
df = pd.read_clipboard()

In [13]:
df

Unnamed: 0,Product,Date,Sales,Profit
0,Printer,2022-05-17,12,$100
1,Laptop,2022-05-18,10,$150
2,Printer,2022-05-19,8,$75
3,Monitor,2022-05-20,15,$120
4,Laptop,2022-05-21,7,$100


In [14]:
df.Date = df.Date.astype('datetime64[ns]')

df['Profit'] = df['Profit'].str.replace('$','').astype('int')
df['Sales'] = df['Sales'].astype('int')


In [16]:
print(f"Total Sales: ${df.Sales.sum()}")
print(f"Total Profit: ${df.Profit.sum()}")

Total Sales: $52
Total Profit: $545


### Question:

Given a string of characters, write a function that calculates the frequency of each character in the string. Additionally, the function should return the character with the maximum frequency and the character with the minimum frequency.

The function signature should be `def char_frequency(string: str) -> dict, str, str:`. The function should return a


In [17]:
def frequency_characters(s):
    split_string = s.split()
    character_frequency = {}
    for character in split_string:
        if character in character_frequency:
            character_frequency[character] += 1
        else:
            character_frequency[character] = 1
    return character_frequency



In [18]:
frequency_characters("I am learning data science")

{'I': 1, 'am': 1, 'learning': 1, 'data': 1, 'science': 1}

In [19]:
def frequency_characters(s):
    split_string = s.split()
    character_frequency = {}
    for word in split_string:
        for character in word:
            if character in character_frequency:
                character_frequency[character] += 1
            else:
                character_frequency[character] = 1
    return character_frequency


In [20]:
frequency_characters("The quick brown fox jumps over the lazy dog")

{'T': 1,
 'h': 2,
 'e': 3,
 'q': 1,
 'u': 2,
 'i': 1,
 'c': 1,
 'k': 1,
 'b': 1,
 'r': 2,
 'o': 4,
 'w': 1,
 'n': 1,
 'f': 1,
 'x': 1,
 'j': 1,
 'm': 1,
 'p': 1,
 's': 1,
 'v': 1,
 't': 1,
 'l': 1,
 'a': 1,
 'z': 1,
 'y': 1,
 'd': 1,
 'g': 1}

In [31]:
character_freqs = frequency_characters("Honolulu")

# get the most frequent character, the number of times it occurs, and any other characters that occur the same number of times
max_freq = max(character_freqs.values())
most_frequent_characters = [k for k, v in character_freqs.items() if v == max_freq]
print(most_frequent_characters, max_freq)

['o', 'l', 'u'] 2


### Question:

You are given a list of dictionaries representing student grades for different subjects. Each dictionary includes the student's name and their grades for math, science, and english. Here's an example:
```python
grades = grades = [
    {"name": "Alice", "math": 85, "science": 92, "english": 88},
    {"name": "Bob", "math": 90, "science": 87, "english": 95},
    {"name": "Charlie", "math": 82, "science": 89, "english": 91},
    {"name": "David", "math": 78, "science": 76, "english": 84},
    {"name": "Eve", "math": 92, "science": 90, "english": 93},
    {"name": "Frank", "math": 89, "science": 92, "english": 87},
    {"name": "Grace", "math": 91, "science": 88, "english": 93},
    {"name": "Henry", "math": 86, "science": 85, "english": 90},
    {"name": "Ivy", "math": 93, "science": 91, "english": 92},
    {"name": "Jack", "math": 88, "science": 86, "english": 89},
    {"name": "Kate", "math": 90, "science": 93, "english": 87},
    {"name": "Liam", "math": 92, "science": 90, "english": 88},
    {"name": "Mia", "math": 84, "science": 87, "english": 91},
    {"name": "Noah", "math": 91, "science": 82, "english": 89},
    {"name": "Olivia", "math": 89, "science": 88, "english": 90},
    {"name": "Patrick", "math": 87, "science": 91, "english": 88},
    {"name": "Quinn", "math": 86, "science": 90, "english": 87},
    {"name": "Ryan", "math": 92, "science": 89, "english": 92},
    {"name": "Sara", "math": 88, "science": 87, "english": 90},
    {"name": "Thomas", "math": 90, "science": 88, "english": 89}
]

```
#### Your task is to write a function that:

Calculates the average grade for each student across all subjects, and adds this to the student's dictionary under the key "average".
Returns a list of all students who have an average grade of 90 or higher. The list should contain only the names of the students, not their entire dictionaries.

In [30]:
grades = grades = [
    {"name": "Alice", "math": 85, "science": 92, "english": 88},
    {"name": "Bob", "math": 90, "science": 87, "english": 95},
    {"name": "Charlie", "math": 82, "science": 89, "english": 91},
    {"name": "David", "math": 78, "science": 76, "english": 84},
    {"name": "Eve", "math": 92, "science": 90, "english": 93},
    {"name": "Frank", "math": 89, "science": 92, "english": 87},
    {"name": "Grace", "math": 91, "science": 88, "english": 93},
    {"name": "Henry", "math": 86, "science": 85, "english": 90},
    {"name": "Ivy", "math": 93, "science": 91, "english": 92},
    {"name": "Jack", "math": 88, "science": 86, "english": 89},
    {"name": "Kate", "math": 90, "science": 93, "english": 87},
    {"name": "Liam", "math": 92, "science": 90, "english": 88},
    {"name": "Mia", "math": 84, "science": 87, "english": 91},
    {"name": "Noah", "math": 91, "science": 82, "english": 89},
    {"name": "Olivia", "math": 89, "science": 88, "english": 90},
    {"name": "Patrick", "math": 87, "science": 91, "english": 88},
    {"name": "Quinn", "math": 86, "science": 90, "english": 87},
    {"name": "Ryan", "math": 92, "science": 89, "english": 92},
    {"name": "Sara", "math": 88, "science": 87, "english": 90},
    {"name": "Thomas", "math": 90, "science": 88, "english": 89}
]


In [31]:
grades

[{'name': 'Alice', 'math': 85, 'science': 92, 'english': 88},
 {'name': 'Bob', 'math': 90, 'science': 87, 'english': 95},
 {'name': 'Charlie', 'math': 82, 'science': 89, 'english': 91},
 {'name': 'David', 'math': 78, 'science': 76, 'english': 84},
 {'name': 'Eve', 'math': 92, 'science': 90, 'english': 93},
 {'name': 'Frank', 'math': 89, 'science': 92, 'english': 87},
 {'name': 'Grace', 'math': 91, 'science': 88, 'english': 93},
 {'name': 'Henry', 'math': 86, 'science': 85, 'english': 90},
 {'name': 'Ivy', 'math': 93, 'science': 91, 'english': 92},
 {'name': 'Jack', 'math': 88, 'science': 86, 'english': 89},
 {'name': 'Kate', 'math': 90, 'science': 93, 'english': 87},
 {'name': 'Liam', 'math': 92, 'science': 90, 'english': 88},
 {'name': 'Mia', 'math': 84, 'science': 87, 'english': 91},
 {'name': 'Noah', 'math': 91, 'science': 82, 'english': 89},
 {'name': 'Olivia', 'math': 89, 'science': 88, 'english': 90},
 {'name': 'Patrick', 'math': 87, 'science': 91, 'english': 88},
 {'name': 'Quin

In [32]:
def top_students(dataset):
    studs = []
    for student in dataset:
        student['average'] = round((student['math'] + student['science'] + student['english']) / 3, 2)
        if student['average'] > 90:
            studs.append(student['name'])
    return studs

In [33]:
grades

[{'name': 'Alice', 'math': 85, 'science': 92, 'english': 88},
 {'name': 'Bob', 'math': 90, 'science': 87, 'english': 95},
 {'name': 'Charlie', 'math': 82, 'science': 89, 'english': 91},
 {'name': 'David', 'math': 78, 'science': 76, 'english': 84},
 {'name': 'Eve', 'math': 92, 'science': 90, 'english': 93},
 {'name': 'Frank', 'math': 89, 'science': 92, 'english': 87},
 {'name': 'Grace', 'math': 91, 'science': 88, 'english': 93},
 {'name': 'Henry', 'math': 86, 'science': 85, 'english': 90},
 {'name': 'Ivy', 'math': 93, 'science': 91, 'english': 92},
 {'name': 'Jack', 'math': 88, 'science': 86, 'english': 89},
 {'name': 'Kate', 'math': 90, 'science': 93, 'english': 87},
 {'name': 'Liam', 'math': 92, 'science': 90, 'english': 88},
 {'name': 'Mia', 'math': 84, 'science': 87, 'english': 91},
 {'name': 'Noah', 'math': 91, 'science': 82, 'english': 89},
 {'name': 'Olivia', 'math': 89, 'science': 88, 'english': 90},
 {'name': 'Patrick', 'math': 87, 'science': 91, 'english': 88},
 {'name': 'Quin

In [34]:
top_students = top_students(grades)
top_students

['Bob', 'Eve', 'Grace', 'Ivy', 'Ryan']

#### If I want to do this with Classes

In [44]:
class Student:
    def __init__(self, name, math_grade, science_grade, english_grade):
        self.name = name
        self.math_grade = math_grade
        self.science_grade = science_grade
        self.english_grade = english_grade

# Create instances of the Student class for the given students
students = [
    Student("Alice", 85, 92, 88),
    Student("Bob", 90, 87, 95),
    Student("Charlie", 82, 89, 96),
    Student("David", 78, 76, 84),
    Student("Eve", 92, 90, 93)
]

# Add 20 additional students
students.append(Student("Frank", 89, 92, 87))
students.append(Student("Grace", 91, 88, 93))
students.append(Student("Henry", 86, 85, 90))
students.append(Student("Ivy", 93, 91, 92))
students.append(Student("Jack", 88, 86, 89))
students.append(Student("Kate", 90, 93, 87))
students.append(Student("Liam", 92, 90, 88))
students.append(Student("Mia", 84, 87, 91))
students.append(Student("Noah", 91, 82, 89))
students.append(Student("Olivia", 89, 88, 90))
students.append(Student("Patrick", 87, 91, 88))
students.append(Student("Quinn", 86, 90, 87))
students.append(Student("Ryan", 92, 89, 92))
students.append(Student("Sara", 88, 87, 90))
students.append(Student("Thomas", 90, 88, 89))

# Print the list of students
for student in students:
    print(f"Name: {student.name}, Math: {student.math_grade}, Science: {student.science_grade}, English: {student.english_grade}")


Name: Alice, Math: 85, Science: 92, English: 88
Name: Bob, Math: 90, Science: 87, English: 95
Name: Charlie, Math: 82, Science: 89, English: 96
Name: David, Math: 78, Science: 76, English: 84
Name: Eve, Math: 92, Science: 90, English: 93
Name: Frank, Math: 89, Science: 92, English: 87
Name: Grace, Math: 91, Science: 88, English: 93
Name: Henry, Math: 86, Science: 85, English: 90
Name: Ivy, Math: 93, Science: 91, English: 92
Name: Jack, Math: 88, Science: 86, English: 89
Name: Kate, Math: 90, Science: 93, English: 87
Name: Liam, Math: 92, Science: 90, English: 88
Name: Mia, Math: 84, Science: 87, English: 91
Name: Noah, Math: 91, Science: 82, English: 89
Name: Olivia, Math: 89, Science: 88, English: 90
Name: Patrick, Math: 87, Science: 91, English: 88
Name: Quinn, Math: 86, Science: 90, English: 87
Name: Ryan, Math: 92, Science: 89, English: 92
Name: Sara, Math: 88, Science: 87, English: 90
Name: Thomas, Math: 90, Science: 88, English: 89


In [45]:
def find_studs(students):
    studs = []
    for student in students:
        # Use list comprehensions to find average grade across all subjects. append studs with names of students' whose average grade is greater than 90
        if sum([student.math_grade, student.science_grade, student.english_grade]) / 3 > 90:
            studs.append(student.name)
    return studs

In [50]:
find_studs(students)

['Bob', 'Eve', 'Grace', 'Ivy', 'Ryan']

You have a dataset that represents the scores of a series of data science quizzes. However, you suspect that some of these scores are incorrect due to system errors. You want to clean this data by removing the outliers using the Z-score method and then calculate some basic statistics about the cleaned data.

Write a Python function that takes a Pandas DataFrame and a column name. The function should:

1. Calculate the Z-score for each score in the specified column of the DataFrame.
2. Consider scores to be outliers if their Z-scores are greater than 3 or less than -3.
3. Remove the outliers from the DataFrame.
4. Calculate and print the mean, median, and standard deviation of the cleaned scores.


In [51]:
import pandas as pd
import numpy as np

In [70]:
quiz_ids = ['q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'q10', 'q11', 'q12']

# Generate random scores between 0 and 100
scores = np.random.uniform(0, 100, 88)

# Create a DataFrame with 100 rows
df = pd.DataFrame({'quiz_id': quiz_ids + ['q' + str(i) for i in range(13, 101)],
                   'score': list(scores) + list(np.random.uniform(0, 100, 12))})

# Display the extended DataFrame
print(df)


   quiz_id      score
0       q1  41.743762
1       q2  73.636967
2       q3  44.450448
3       q4  91.189612
4       q5  74.946726
..     ...        ...
95     q96  77.682330
96     q97  87.977109
97     q98  40.593418
98     q99  98.832670
99    q100  54.781276

[100 rows x 2 columns]


In [71]:
quiz_scores['z_score'] = (quiz_scores['score'] - quiz_scores['score'].mean()) / quiz_scores['score'].std()

z = (x - μ) / σ


<u>​Where:</u>

- z is the z-score
- x is the observed value,
- μ is the mean of the population, and
- σ is the standard deviation of the population.
 


In [72]:
quiz_scores

Unnamed: 0,quiz_id,score,z_score
0,q1,90.5,-0.053555
1,q2,85.0,-0.12569
2,q3,95.5,0.012022
3,q4,80.0,-0.191267
4,q5,84.0,-0.138805
5,q6,96.5,0.025138
6,q7,82.0,-0.165036
7,q8,97.5,0.038253
8,q9,79.0,-0.204382
9,q10,95.0,0.005465


In [73]:
quiz_scores['Outlier'] = False
quiz_scores.loc[quiz_scores['z_score'].abs() > 3, 'score']['Outlier'] = True
quiz_scores

Unnamed: 0,quiz_id,score,z_score,Outlier
0,q1,90.5,-0.053555,False
1,q2,85.0,-0.12569,False
2,q3,95.5,0.012022,False
3,q4,80.0,-0.191267,False
4,q5,84.0,-0.138805,False
5,q6,96.5,0.025138,False
6,q7,82.0,-0.165036,False
7,q8,97.5,0.038253,False
8,q9,79.0,-0.204382,False
9,q10,95.0,0.005465,False


In [75]:
quiz_scores.loc[quiz_scores['Outlier'] == True]

Unnamed: 0,quiz_id,score,z_score,Outlier


In [76]:
quiz_scores.drop(quiz_scores.loc[quiz_scores['Outlier'] == True].index, inplace=True)

No rows were outliers, but this is how it would be done.

In [80]:
quiz_scores_mean = quiz_scores['score'].mean().round(2)
quiz_scores_median = quiz_scores['score'].median().round(2)
quiz_scores_std = quiz_scores['score'].std().round(2)

print(f"quiz score mean = {quiz_scores_mean}")
print(f"quiz score median = {quiz_scores_median}")
print(f"quiz score standard deviation = {quiz_scores_std}")

quiz score mean = 94.58
quiz score median = 87.75
quiz score standard deviation = 76.25
