### Query to Analysis

#### User Behavior Analysis
1. **What are the most common rating patterns?**
   - Distribution of ratings (e.g., % of 1-star, 2-star, etc.).
   - Are users generally lenient (high average ratings) or strict?

2. **Which customers are the most active?**
   - Identify top users based on the number of ratings given.

3. **Are there patterns in user activity over time?**
   - Time-series analysis of ratings: Are there seasonal trends in rating activity?
   - Do users rate more movies during certain months or years?

4. **What are the average ratings by users?**
   - Identify users with extreme behavior (always high ratings or always low ratings).

---

#### Movie Popularity and Trends
5. **Which movies are the most rated?**
   - Find the movies with the highest number of ratings.

6. **What is the average rating of movies?**
   - Identify the highest-rated and lowest-rated movies on average.

7. **Are there trends in ratings over time for specific movies?**
   - For popular movies, analyze how ratings change over time.

8. **Does the year of release influence ratings?**
   - Are older or newer movies rated more favorably?

---

#### Correlation and Distribution
9. **What is the correlation between the number of ratings a movie receives and its average rating?**
   - Do more popular movies tend to have higher or lower ratings?

10. **How do ratings vary across different years of movie release?**
    - Distribution of ratings for movies by decade or year.

---

#### Personalized Recommendations
11. **Which genres or types of movies do specific users prefer?**
    - Cluster users based on their rating history and infer preferences.

12. **Can we predict user ratings for a movie?**
    - Build a predictive model to estimate how a user might rate a movie they haven’t seen.

---

#### Temporal Insights
13. **How do rating trends evolve over time?**
    - Are there periods when users tend to give higher or lower ratings?
    - Identify any significant shifts in rating behavior (e.g., during holidays).

14. **Are there specific years or months when certain types of movies receive better ratings?**
    - E.g., holiday-themed movies around December.

---

#### Anomalies and Outliers
15. **Are there users or movies with unusual patterns?**
    - Users who rate all movies the same or movies that receive a disproportionate number of extreme ratings.

16. **What are the least-rated movies?**
    - Identify movies with the lowest number of ratings and explore why.

---

#### Cross-Dataset Insights
17. **Which movies have the highest rating variance?**
    - Movies with polarized opinions (some love, some hate).

18. **Are there trends in ratings by movie release year?**
    - Compare older classic movies with newer ones.

19. **What percentage of users rate movies from every decade?**
    - Explore user engagement with movies across time periods.

20. **What is the relationship between a movie’s release year and the number of ratings it receives?**
    - Do newer movies get rated more frequently?

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [36]:
movie_df = pd.read_csv("../../resources/netflix_movies.csv")
user_rating_df = pd.read_csv("../../resources/Netflix_User_Ratings.csv")

In [37]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17770 entries, 0 to 17769
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MovieId      17770 non-null  int64  
 1   ReleaseYear  17763 non-null  float64
 2   MovieTitle   17770 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 416.6+ KB


<p>Checking info of both the df , to get little information about the data we have and their type.</p>
<p>Having proper type of data will help us to work on it better</p>

In [38]:
user_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100480507 entries, 0 to 100480506
Data columns (total 4 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   CustId   int64 
 1   Rating   int64 
 2   Date     object
 3   MovieId  int64 
dtypes: int64(3), object(1)
memory usage: 3.0+ GB


As the Date has 'object' datatype , we have to convert it into date type.

In [39]:
user_rating_df['Date'] = user_rating_df['Date'].astype('date32[pyarrow]')
user_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100480507 entries, 0 to 100480506
Data columns (total 4 columns):
 #   Column   Dtype               
---  ------   -----               
 0   CustId   int64               
 1   Rating   int64               
 2   Date     date32[day][pyarrow]
 3   MovieId  int64               
dtypes: date32[day][pyarrow](1), int64(3)
memory usage: 2.6 GB


In [48]:
#setting up index 
user_rating_df['index'] = np.arange(1,user_rating_df.shape[0]+1)

In [49]:
user_rating_df.set_index('index')


Unnamed: 0_level_0,CustId,Rating,Date,MovieId
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1488844,3,2005-09-06,1
2,822109,5,2005-05-13,1
3,885013,4,2005-10-19,1
4,30878,4,2005-12-26,1
5,823519,3,2004-05-03,1
...,...,...,...,...
100480503,1790158,4,2005-11-01,17770
100480504,1608708,3,2005-07-19,17770
100480505,234275,1,2004-08-07,17770
100480506,255278,4,2004-05-28,17770


In [50]:
user_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100480507 entries, 0 to 100480506
Data columns (total 5 columns):
 #   Column   Dtype               
---  ------   -----               
 0   CustId   int64               
 1   Rating   int64               
 2   Date     date32[day][pyarrow]
 3   MovieId  int64               
 4   index    int64               
dtypes: date32[day][pyarrow](1), int64(4)
memory usage: 3.4 GB


adding index in movie csv

In [None]:
movie_df['index'] = np.arange(1,movie_df.shape[0]+1)
movie_df.set_index('index',inplace=True)

In [52]:
movie_df.shape[0]

17770

In [57]:
movie_df.head()

Unnamed: 0_level_0,MovieId,ReleaseYear,MovieTitle
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2003.0,Dinosaur Planet
2,2,2004.0,Isle of Man TT 2004 Review
3,3,1997.0,Character
4,4,1994.0,Paula Abdul's Get Up & Dance
5,5,2004.0,The Rise and Fall of ECW


In [7]:
user_rating_df.isna().sum()

CustId     0
Rating     0
Date       0
MovieId    0
dtype: int64

In [8]:
movie_df.isna().sum()

MovieId        0
ReleaseYear    7
MovieTitle     0
dtype: int64

In [14]:
movie_df.head()

Unnamed: 0,MovieId,ReleaseYear,MovieTitle
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW
