<a href="https://colab.research.google.com/github/isys5002-itp/ISYS5002-2023-Semester1/blob/main/10_visualise_netflix_student_Thu12PM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Write a program that reads the netflix.csv file and counts the number of titles in the each genres for movies produced in Australia.
Plot the number of movies in each genre.

Reference source: https://wuraolaifeoluwa.medium.com/basic-data-wrangling-and-visualization-of-netflix-data-8b9609328f8c

In [None]:
import pandas as pd
netflix_df = pd.read_csv("netflix_data.csv")

In [None]:
netflix_df.info()

In [None]:
netflix_df.head()

In [None]:
#  count the number of missing values (also known as "null" or "NaN" values) in each column
netflix_df.isnull().sum()

In [None]:
#date_added
# forward fill method to fill null values in column ‘date_added’
netflix_df['date_added'] = netflix_df.date_added.fillna(method ='ffill')

#country
# filled ‘country’ columns with the most occurring values because the columns is categorical contain data with unique values. 
netflix_df['country'] = netflix_df.country.fillna(netflix_df['country'].mode()[0])

In [None]:
netflix_df.isnull().sum()

Python Pandas Tutorial: A Complete Introduction for Beginners: 
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/



In [None]:
# selecting one column
# This will return a Series. 
df_series = netflix_df['release_year']
print(df_series)

# To extract a column as a DataFrame, you need to pass a list of column names. 
df_release_year = netflix_df[['release_year']]
df_release_year

In [None]:
# selecting two columns
two_column = ['release_year','genre']
netflix_df[two_column]
# OR
# netflix_df[['release_year', 'genre']]

In [None]:
subset_df = netflix_df[['show_id', 'country','release_year', 'genre', 'type', 'duration']]
subset_df

In [None]:
subset_df.info()


In [None]:
import pandas as pd

df = subset_df

# group the data by type, and calculate the mean release year and count of show_id for each group
grouped = df.groupby(['country', 'type']).agg({'duration': 'mean'})

# print the result
print(grouped)


In [None]:
# use unstack() method to "pivot"/transform/reorganise one of the levels of the index to become column headers.

grouped = grouped.unstack()

In [None]:
# plot a bar chart of the count of TV shows and movies by country
grouped_10 = grouped.count().sort_values(ascending=False).head(10)
grouped_10.plot(kind='bar', stacked=True, figsize=(10, 6))

In [None]:
# group the data by type, and calculate the mean release year and count of show_id for each group
grouped = df.groupby(['country', 'type']).agg({'release_year': 'mean', 'show_id': 'count'})

# print the result
print(grouped)

### Grouby
* grouping data based on one or more columns - categorical variables
* split a dataset into groups based on a particular column or columns and then apply a function (such as mean, sum, count, etc.) to each group.

```
grouped = dataframe.groupby('column_name')
```
E.g.: Analyse of the distribution of tv shows to movies

In [None]:
# df_type = subset_df.groupby('type')['type'].count()
# df_type

# OR

df_type = subset_df.groupby('type').size()
print(df_type)

Plot a pie chart

In [None]:
# Plot the data as a pie chart
df_type.plot(kind='pie', autopct='%1.1f%%')

# Add title and labels
plt.title('Distribution of TV Shows to Movies')
plt.ylabel('')

# Show the plot
plt.show()

In [None]:
import pandas as pd

# read in the dataset
df = subset_df

# group the data by country and release year, and count the number of titles in each group
grouped = df.groupby(['country', 'release_year'])['show_id'].count()

# print the result
print(grouped)


You can also pass a list of column names to group by multiple columns.

Once you have grouped the data, you can apply various functions to the grouped data using the agg() method. For example, you can calculate the mean of each group like this:


```
grouped = dataframe.groupby('column_name')
grouped_mean = grouped.agg('mean')

```


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas dataframe
df = subset_df
# Filter the data to only include release years after 2010
df = df[df['release_year'] >= 2010]

# Group the data by release year and type, and calculate the mean duration
grouped = df.groupby(['release_year', 'type']).agg({'duration': 'mean'})

# Reset the index to make the release year and type columns regular columns
grouped = grouped.reset_index()


Plot a line graph

In [None]:
# Plot the data as a line chart
plt.plot(grouped[grouped['type'] == 'TV Show']['release_year'], grouped[grouped['type'] == 'TV Show']['duration'], label='TV Show')
plt.plot(grouped[grouped['type'] == 'Movie']['release_year'], grouped[grouped['type'] == 'Movie']['duration'], label='Movie')

# Add labels and legend to the chart
plt.xlabel('Release Year')
plt.ylabel('Average Duration')
plt.title('Average Duration of TV Shows and Movies Released After 2010')
plt.legend()

# Display the chart
plt.show()

### Filtering

E.g.: Determine the countries that produced the most movies in the year 2020.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas DataFrame
df = netflix_df

# Filter the DataFrame to only include movies released in 2021
movies_2021 = df[(df["type"] == "Movie") & (df["release_year"] == 2020)]

# Group the data by country and count the number of movies in each group
grouped = movies_2021.groupby("country")["show_id"].count().sort_values(ascending=False)


Plot a bar graph

In [None]:
ax = grouped.plot(kind="bar", figsize=(10, 5))
ax.set_xlabel("Country")
ax.set_ylabel("Number of Movies Released in 2020")
ax.set_title("Number of Movies by Country for 2020")
plt.show()

Get the top 10 countries with most movies

In [None]:
# Let's get top 10
grouped = movies_2021.groupby("country")["show_id"].count().sort_values(ascending=False).head(10)


Plot horizontal bar graph

In [None]:
# Create a horizontal bar graph of the counts by country

ax = grouped.sort_values().plot(kind="barh", figsize=(10, 6))
ax.set_xlabel("Number of Movies Released")
ax.set_ylabel("Country")
ax.set_title("Top 10 Countries with Most Movies Released in 2020")
plt.show()
