# Data Analysis

In [19]:
from filmsdk_ibrahim import MovieClient, MovieConfig
import time
import pandas as pd

In [39]:
# API Connection

config = MovieConfig(movie_base_url="http://localhost")
client = MovieClient(config=config)

client.health_check()

MOVIE_API_BASE_URL in MovieConfig init: https://backend-cinema-96tw.onrender.com


{'message': 'api is working well'}

In [22]:
# Test of getting a film
movie = client.get_movie(1)
print(f"Ttile : {movie.title}")
print(f"Genre : {movie.genres}")

Ttile : Toy Story (1995)
Genre : Adventure|Animation|Children|Comedy|Fantasy


In [23]:
# Get ratings (limit = 100)
ratings_df = client.list_ratings(output_format="pandas")
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
95,1,1445,3.0,964984112
96,1,1473,4.0,964980875
97,1,1500,4.0,964980985
98,1,1517,5.0,964981107


In [24]:
# Get df infos
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   userId     100 non-null    int64  
 1   movieId    100 non-null    int64  
 2   rating     100 non-null    float64
 3   timestamp  100 non-null    int64  
dtypes: float64(1), int64(3)
memory usage: 3.3 KB


In [25]:
# Get statistics
analytics = client.get_analytics()
print(analytics)

movie_count=9742 rating_count=100836 tag_count=3683 link_count=9742


In [26]:
# Get total ratings
total_ratings = analytics.rating_count
total_ratings

100836

In [27]:
# Get all ratings
total_ratings = client.get_analytics().rating_count
batch_size = 1000
all_ratings = []

for skip in range(0, total_ratings, batch_size):
    print(f"Get lines from {skip} to {skip + batch_size}...")
    batch_df = client.list_ratings(skip=skip, limit=batch_size, output_format="pandas")
    all_ratings.append(batch_df)
    time.sleep(0.5)

complete_ratings_df = pd.concat(all_ratings, ignore_index=True)

complete_ratings_df

Get lines from 0 to 1000...
Get lines from 1000 to 2000...
Get lines from 2000 to 3000...
Get lines from 3000 to 4000...
Get lines from 4000 to 5000...
Get lines from 5000 to 6000...
Get lines from 6000 to 7000...
Get lines from 7000 to 8000...
Get lines from 8000 to 9000...
Get lines from 9000 to 10000...
Get lines from 10000 to 11000...
Get lines from 11000 to 12000...
Get lines from 12000 to 13000...
Get lines from 13000 to 14000...
Get lines from 14000 to 15000...
Get lines from 15000 to 16000...
Get lines from 16000 to 17000...
Get lines from 17000 to 18000...
Get lines from 18000 to 19000...
Get lines from 19000 to 20000...
Get lines from 20000 to 21000...
Get lines from 21000 to 22000...
Get lines from 22000 to 23000...
Get lines from 23000 to 24000...
Get lines from 24000 to 25000...
Get lines from 25000 to 26000...
Get lines from 26000 to 27000...
Get lines from 27000 to 28000...
Get lines from 28000 to 29000...
Get lines from 29000 to 30000...
Get lines from 30000 to 31000...

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [29]:
complete_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [30]:
# Get ratings by user
ratings_per_user = complete_ratings_df['userId'].value_counts().rename_axis('userId').reset_index(name='rating_count')

ratings_per_user

Unnamed: 0,userId,rating_count
0,414,2698
1,599,2478
2,474,2108
3,448,1864
4,274,1346
...,...,...
605,320,20
606,207,20
607,257,20
608,569,20


## Relevent Business Question 

### Which film genres do users tag most positively (rating ≥ 4.0), and what are the most frequent tags associated with these genres?

Why is this relevant?
It helps understand user preferences not only through ratings but also through the qualitative tags they add.

A marketing analyst or a recommendation algorithm can use this information to:

-recommend similar movies,

-optimize movie classification,

-better understand the “moods” or intentions behind high ratings.

### Required data (via SDK + API):
-ratings: to filter for high ratings (rating ≥ 4.0).

-tags: to see which tags are used on the same (userId, movieId) pairs.

-movies: to enrich with the corresponding genres.

### Main steps:
-List all ratings ≥ 4.0 (in batches if necessary).

-For each filtered (user_id, movieId), try to retrieve a tag via client.get_tag(...) or by listing all tags and matching.

-Retrieve the movie genres via client.get_movie(movieId).

-Aggregate: genre -- tag -- frequency.

In [31]:
# Step 1: Retrieve high ratings (rating >= 4.0) in batches

chunk_size = 1000
skip = 0
all_high_ratings = []

while True:
    chunk = client.list_ratings(
        skip=skip,
        limit=chunk_size,
        min_rating=4.0,
        output_format="pandas"
    )
    
    if chunk.empty:
        break
    
    all_high_ratings.append(chunk)
    skip += chunk_size
    time.sleep(0.5)  # Pause between calls to avoid errors

# Merge all chunks
high_ratings_df = pd.concat(all_high_ratings, ignore_index=True)
print(high_ratings_df.shape)
high_ratings_df


(48580, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
48575,610,166528,4.0,1493879365
48576,610,166534,4.0,1493848402
48577,610,168248,5.0,1493850091
48578,610,168250,5.0,1494273047


In [32]:
# Step 2: Identify unique (userId, movieId) pairs

user_movie_pairs = high_ratings_df[['userId', 'movieId']].drop_duplicates()
user_movie_pairs

Unnamed: 0,userId,movieId
0,1,1
1,1,3
2,1,6
3,1,47
4,1,50
...,...,...
48575,610,166528
48576,610,166534
48577,610,168248
48578,610,168250


In [33]:
# Step 3: Retrieve the corresponding tags

all_tags = []
skip = 0
chunk_size = 1000
while True:
    tag_chunk = client.list_tags(skip=skip, limit=chunk_size, output_format="pandas")
    if tag_chunk.empty:
        break
    all_tags.append(tag_chunk)
    skip += chunk_size
    time.sleep(0.5)

all_tags_df = pd.concat(all_tags, ignore_index=True)

all_tags_df

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [34]:
# Merge with high ratings
tagged_high_ratings = pd.merge(user_movie_pairs, all_tags_df, on=["userId", "movieId"])
print(tagged_high_ratings.shape)
tagged_high_ratings

(2378, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
2373,606,6107,World War II,1178473747
2374,606,7382,for katie,1171234019
2375,610,3265,gun fu,1493843984
2376,610,3265,heroic bloodshed,1493843978


In [35]:
# Step 4: Retrieve genres associated with movieId

def get_movie_genre(movie_id):
    try:
        movie = client.get_movie(movie_id)
        return movie.genres
    except:
        return ""

# Apply only to the unique movieIds that we have in the tags
unique_movie_ids = tagged_high_ratings['movieId'].unique()

movie_genres = {
    movie_id: get_movie_genre(movie_id)
    for movie_id in unique_movie_ids
}

# Add the genres column
tagged_high_ratings['genres'] = tagged_high_ratings['movieId'].map(movie_genres)
print(tagged_high_ratings.shape)
tagged_high_ratings

(2378, 5)


Unnamed: 0,userId,movieId,tag,timestamp,genres
0,2,60756,funny,1445714994,Comedy
1,2,60756,Highly quotable,1445714996,Comedy
2,2,60756,will ferrell,1445714992,Comedy
3,2,89774,Boxing story,1445715207,Drama
4,2,89774,MMA,1445715200,Drama
...,...,...,...,...,...
2373,606,6107,World War II,1178473747,Drama|War
2374,606,7382,for katie,1171234019,Drama|Mystery|Thriller
2375,610,3265,gun fu,1493843984,Action|Crime|Drama|Thriller
2376,610,3265,heroic bloodshed,1493843978,Action|Crime|Drama|Thriller


In [36]:
# Step 5: Final aggregation: genre ↔ tag ↔ count

# Split genres when we have "|"
tagged_high_ratings['genres'] = tagged_high_ratings['genres'].str.split('|')
tagged_exploded = tagged_high_ratings.explode('genres')

tagged_exploded

Unnamed: 0,userId,movieId,tag,timestamp,genres
0,2,60756,funny,1445714994,Comedy
1,2,60756,Highly quotable,1445714996,Comedy
2,2,60756,will ferrell,1445714992,Comedy
3,2,89774,Boxing story,1445715207,Drama
4,2,89774,MMA,1445715200,Drama
...,...,...,...,...,...
2376,610,3265,heroic bloodshed,1493843978,Drama
2376,610,3265,heroic bloodshed,1493843978,Thriller
2377,610,168248,Heroic Bloodshed,1493844270,Action
2377,610,168248,Heroic Bloodshed,1493844270,Crime


In [37]:
# Count the Genre / Tag combinations
genre_tag_summary = (
    tagged_exploded
    .groupby(['genres', 'tag'])      # Group by 'genres' and 'tag'
    .size()                         # Count the number of occurrences in each group
    .reset_index(name='count')      # Reset the index and rename the count column to 'count'
    .sort_values(by='count', ascending=False)  # Sort the results by 'count' in descending order
)

genre_tag_summary

Unnamed: 0,genres,tag,count
1971,Drama,In Netflix queue,20
2159,Drama,atmospheric,19
4321,Thriller,twist ending,16
3280,Mystery,twist ending,14
4304,Thriller,suspense,14
...,...,...,...
4467,Western,silly,1
4468,Western,tension building,1
4469,Western,violent,1
4470,Western,visually appealing,1


# Highlights
This table genre_tag_summary provides a cross-analysis between movie genres and the most commonly used tags by users who have given a high rating (rating >= 4.0). Here are some interesting comments and interpretations:

# What the table shows
-Each row represents a unique combination of genre and tag.

-The count column indicates how many times a certain tag has been associated with a movie of a certain genre, in the context of a high rating.

# Examples

-Drama + In Netflix queue was tagged 20 times for well-rated movies in the Drama genre.

-Thriller + twist ending was tagged 16 times, giving clues about what people like in thrillers.

# Business interpretations
### Discovering viewer preferences by genre:

-Users like dramatic movies that they plan to watch later (In Netflix queue) — this tag might reflect interest or indirect recommendation.

Thrillers with twist endings or a suspenseful atmosphere are particularly appreciated → prioritize these for recommendations.

### Usefulness for a recommendation engine:

By analyzing the tags most associated with highly rated movies in each genre, personalized recommendations can be better targeted.

For example, recommend “Mystery” movies with twists to those who like well-rated “Thrillers” with that tag.

### Marketing and categorization insights:

Platforms can create categories like:

-“Mystery with a Twist”

-“Atmospheric Dramas”

-“Suspenseful Thrillers”

These groupings can enhance engagement by better matching what people like with what’s offered to them.

### Targeted Campaigns Highlighting Popular Tags:

-High counts of tags such as “In Netflix queue” and “atmospheric” in Drama indicate user interest that marketing teams can leverage to promote dramas with these appealing qualities.

### Cross-Genre Promotion Opportunities:

The popularity of “twist ending” in both Thriller and Mystery genres suggests cross-promotion potential, encouraging users who enjoy one genre to explore the other.

### Niche Tag Identification for Specialized Content:

Less frequent tags like “Borg” (Action) or “morality” (Western) reveal niche interests that could be the focus of micro-campaigns or specialized collections to engage specific audience segments.