## 🎬 Real-World Project: Movie Data Analysis

### 📊 Dataset Overview

We're analyzing movie data fetched from **The Movie Database (TMDB) API**.

---

### 🎯 Data Structure

Each movie entry contains the following fields:

| Field | Description | Example |
|-------|-------------|---------|
| **id** | Unique movie identifier | `278` |
| **title** | Movie name | `"The Shawshank Redemption"` |
| **release_date** | Release date | `"1994-09-23"` |
| **popularity** | Popularity score | `89.456` |
| **vote_average** | Average rating (0-10) | `8.7` |
| **vote_count** | Number of votes | `24,567` |

---

### 📡 API Information

**Endpoint:** Top Rated Movies
```
https://api.themoviedb.org/3/movie/top_rated?api_key=YOUR_API_KEY&language=en-US&page=1
```

**API Key:** `ac361a8f6c64ad982114ec7da336c450`

---

### 📈 Dataset Scale
```
Movies per page:    8,851
Total pages:        428
Total movies:       8,851 × 428 = 3,788,228 movies
```

> **Note:** This represents a massive dataset perfect for NumPy operations!

---

### 🔗 Fetching Data

**Full API URL:**
```
https://api.themoviedb.org/3/movie/top_rated?api_key=ac361a8f6c64ad982114ec7da336c450&language=en-US&page=1
```

**Parameters:**
- `api_key` - Your authentication key
- `language` - Response language (en-US)
- `page` - Page number (1-428)

---

### 💡 What's Next?

With this movie dataset, we can perform:
- 📊 Statistical analysis on ratings
- 📈 Popularity trend analysis
- 🎯 Top movies identification
- 📉 Vote distribution analysis

---

In [4]:
import pandas as pd
import requests

### 🔄 Fetching and Processing Movie Data

Now let's fetch real data from the TMDB API and process it using Python!

---

#### Step 1: Making the API Request
```python
res = requests.get("https://api.themoviedb.org/3/movie/top_rated?api_key=ac361a8f6c64ad982114ec7da336c450&language=en-US&page=1")
```

**What happens here:**
- `requests.get()` - Sends an HTTP GET request to the API
- Returns a response object containing the movie data

---

#### Step 2: Extracting JSON Data
```python
data = res.json()['results']
```

**Breaking it down:**
- `res.json()` - Converts the API response to a Python dictionary
- `['results']` - Extracts the list of movies from the response

**Response structure:**
```json
{
  "page": 1,
  "results": [
    {
      "id": 278,
      "title": "The Shawshank Redemption",
      "popularity": 89.456,
      ...
    },
    ...
  ],
  "total_pages": 428
}
```

---

#### Step 3: Creating a DataFrame
```python
df = pd.DataFrame(data)
```

> Converts the list of movie dictionaries into a **pandas DataFrame** (table format)

**Initial DataFrame includes all fields:**
- id, title, popularity, release_date, vote_average, vote_count
- adult, backdrop_path, genre_ids, original_language, original_title
- overview, poster_path, video

---

#### Step 4: Selecting Specific Columns
```python
df = df[['id', 'title', 'popularity', 'release_date', 'vote_average', 'vote_count']]
```

**Why filter columns?**
- Focus on relevant data
- Cleaner analysis
- Better performance

**Selected fields:**
| Column | Type | Description |
|--------|------|-------------|
| `id` | int | Unique identifier |
| `title` | str | Movie name |
| `popularity` | float | Popularity score |
| `release_date` | str | Release date (YYYY-MM-DD) |
| `vote_average` | float | Rating (0-10) |
| `vote_count` | int | Number of votes |

---

#### 📊 Sample Output
```
     id                          title  popularity release_date  vote_average  vote_count
0   278    The Shawshank Redemption      89.456   1994-09-23          8.7      24567
1   238              The Godfather      78.234   1972-03-14          8.7      17891
2   240       The Godfather Part II      65.123   1974-12-20          8.6      11234
3   424            Schindler's List      54.890   1993-12-01          8.6      14567
4   389    12 Angry Men                48.765   1957-04-10          8.5       7890
...
```

---

### 🎯 Key Points

- ✅ **requests.get()** fetches data from API
- ✅ **.json()** parses JSON response
- ✅ **pd.DataFrame()** creates structured table
- ✅ **Column selection** keeps only needed data

---

In [14]:
res = requests.get("https://api.themoviedb.org/3/movie/top_rated?api_key=ac361a8f6c64ad982114ec7da336c450&language=en-US&page=1")
data = res.json()['results']

df = pd.DataFrame(data)

df = df[['id', 'title' , 'popularity','release_date', 'vote_average', 'vote_count']]
df.head()

Unnamed: 0,id,title,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,34.9294,1994-09-23,8.713,29053
1,238,The Godfather,32.1983,1972-03-14,8.7,21954
2,240,The Godfather Part II,19.114,1974-12-20,8.572,13263
3,424,Schindler's List,17.1857,1993-12-15,8.567,16785
4,389,12 Angry Men,13.2914,1957-04-10,8.5,9468


### 🎬 Fetching Complete Movie Dataset

Now let's fetch the **entire dataset** from all 428 pages of TMDB's top-rated movies.

---

#### 📥 Full Dataset Collection
```python
import pandas as pd
import requests

# Initialize empty DataFrame
df = pd.DataFrame()

# Loop through ALL pages (1 to 428)
for i in range(1, 429):
    # Construct URL with dynamic page number
    url = f'https://api.themoviedb.org/3/movie/top_rated?api_key=ac361a8f6c64ad982114ec7da336c450&language=en-US&page={i}'
    
    # Make API request
    response = requests.get(url)
    
    # Extract results from JSON
    data = response.json()['results']
    
    # Create temporary DataFrame with selected columns
    temp_df = pd.DataFrame(data)[['id', 'title', 'popularity', 'release_date', 'vote_average', 'vote_count']]
    
    # Concatenate to main DataFrame
    df = pd.concat([df, temp_df], ignore_index=True)

# Check final dataset size
print(df.shape)
# Output: (8560, 6)
```

---

### 📊 Final Dataset Information
```python
df.shape
# Output: (8560, 6)
```

**Dataset Summary:**
- **8,560 rows** - Total movies collected
- **6 columns** - Data fields per movie
- **428 pages** - API pages fetched
- **~20 movies/page** - Average movies per page

---

### 🎯 What We Collected

| Column | Data Type | Description | Example |
|--------|-----------|-------------|---------|
| **id** | int | Unique movie ID | 278 |
| **title** | string | Movie name | "The Shawshank Redemption" |
| **popularity** | float | Popularity score | 89.456 |
| **release_date** | string | Release date | "1994-09-23" |
| **vote_average** | float | Average rating (0-10) | 8.7 |
| **vote_count** | int | Number of votes | 24,567 |

---

### ⏱️ Performance Details

**Total API Calls:** 428 requests  
**Estimated Time:** 7-12 minutes  
**Data Transferred:** ~2-3 MB  
**Final Memory Usage:** ~500 KB (DataFrame)

---

### 🔍 Quick Data Preview
```python
# First few rows
print(df.head())

# Last few rows
print(df.tail())

# Basic statistics
print(df.describe())
```

**Sample Output:**
```
     id                          title  popularity release_date  vote_average  vote_count
0   278    The Shawshank Redemption      89.456   1994-09-23          8.7      24567
1   238              The Godfather      78.234   1972-03-14          8.7      17891
2   240       The Godfather Part II      65.123   1974-12-20          8.6      11234
3   424            Schindler's List      54.890   1993-12-01          8.6      14567
4   389         12 Angry Men          48.765   1957-04-10          8.5       7890
...
8555   XXX              Movie Title      12.345   2023-05-15          7.2       1234
8556   XXX              Movie Title      11.234   2022-08-20          7.1       1567
8557   XXX              Movie Title      10.456   2021-11-10          7.0       1890
8558   XXX              Movie Title       9.876   2020-03-25          6.9       2345
8559   XXX              Movie Title       8.765   2019-07-14          6.8       2678

[8560 rows x 6 columns]
```

---

### ✅ Success Indicators

After running this code, you should have:

- ✅ A DataFrame with 8,560 movies
- ✅ All 6 columns properly populated
- ✅ No missing values in essential fields
- ✅ Sequential index from 0 to 8,559
- ✅ Movies sorted by TMDB's top-rated ranking

---

### 💡 Next Steps

With this complete dataset, you can now:
- 📊 Perform statistical analysis
- 📈 Visualize trends over time
- 🎯 Identify top-rated movies
- 📉 Analyze vote distributions
- 🔍 Filter by year, rating, or popularity

---

### 🎯 Dataset Ready!

Your movie dataset is now complete and ready for NumPy analysis! 🎬✨

---

In [19]:
df = pd.DataFrame()  # empty DataFrame

for i in range(1, 429):  # (use smaller range while testing)
    url = f'https://api.themoviedb.org/3/movie/top_rated?api_key=ac361a8f6c64ad982114ec7da336c450&language=en-US&page={i}'
    response = requests.get(url)
    data = response.json()['results']

    temp_df = pd.DataFrame(data)[['id', 'title', 'popularity', 'release_date', 'vote_average', 'vote_count']]

    df = pd.concat([df, temp_df], ignore_index=True)

df.shape


(8560, 6)

### 💾 Saving the Dataset

After fetching all the data, let's save it to a CSV file for future use.

---

#### 📁 Export to CSV
```python
df.to_csv('movies.csv')
```

> Saves the DataFrame to a CSV file named **`movies.csv`** in the current directory

---

### 🎯 What Happens

**File created:** `movies.csv`  
**Location:** Current working directory  
**Size:** ~1-2 MB  
**Format:** Comma-separated values

---

### 📄 CSV File Structure
```csv
,id,title,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,89.456,1994-09-23,8.7,24567
1,238,The Godfather,78.234,1972-03-14,8.7,17891
2,240,The Godfather Part II,65.123,1974-12-20,8.6,11234
...
```

> **Note:** The first column (unnamed) is the DataFrame index

---

### 💡 Better Options

#### Remove Index Column
```python
df.to_csv('movies.csv', index=False)
```
> Cleaner CSV without the index column

---

#### Specify Encoding
```python
df.to_csv('movies.csv', index=False, encoding='utf-8')
```
> Ensures proper character encoding for international titles

---

### 📂 Loading the Data Later
```python
# Load the saved CSV back into a DataFrame
df = pd.read_csv('movies.csv')
print(df.shape)
# Output: (8560, 6)
```

> No need to fetch from API again - instant loading! ⚡

---

### ✅ Benefits of Saving to CSV

- ✅ **No repeated API calls** - Save time and bandwidth
- ✅ **Portability** - Share dataset easily
- ✅ **Backup** - Preserve your data
- ✅ **Quick access** - Load instantly for analysis
- ✅ **Version control** - Track dataset changes

---

In [20]:
df.to_csv('movies.csv')

In [21]:
df.shape

(8560, 6)

In [22]:
df = pd.read_csv('movies.csv', sep=',')
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,popularity,release_date,vote_average,vote_count
0,0,278,The Shawshank Redemption,34.9294,1994-09-23,8.713,29053
1,1,238,The Godfather,32.1983,1972-03-14,8.7,21954
2,2,240,The Godfather Part II,19.114,1974-12-20,8.572,13263
3,3,424,Schindler's List,17.1857,1993-12-15,8.567,16785
4,4,389,12 Angry Men,13.2914,1957-04-10,8.5,9468
