Q1. Are there any inconsistent or incorrect data entries that need to be corrected or standardized?

In [None]:
import pandas as pd
import numpy as np


In [None]:
df = pd.read_csv('tmdb_5000_movies.csv')

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

In [None]:
numeric_columns = ['budget', 'revenue', 'runtime', 'vote_count', 'vote_average']
for column in numeric_columns:
    df[column] = pd.to_numeric(df[column], errors='coerce')
    incorrect_values = df[df[column].isnull()]
    print(f"Incorrect values in {column}:")
    print(incorrect_values)


In [None]:
categorical_columns = ['original_language', 'status']
for column in categorical_columns:
    incorrect_values = df[df[column].isnull() | (df[column] == '')]
    print(f"Incorrect values in {column}:")
    print(incorrect_values)

In [None]:
# Identify rows with missing or empty values
invalid_rows = df[df.isnull().any(axis=1)]
print("Rows with missing or empty values:")
print(invalid_rows)

# Identify rows with duplicate entries
duplicate_rows = df[df.duplicated()]
print("Duplicate rows:")
print(duplicate_rows)

These steps will help identify any missing values, incorrect values in specific columns, and inconsistencies across the entire dataset. We can further apply appropriate data cleaning or standardization techniques based on the specific issues you identify.

Q2. How can we visualize the distribution of movie revenue and runtime in the dataset?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas DataFrame
df = pd.read_csv('tmdb_5000_movies.csv')

# Plotting the distribution of movie revenue

plt.figure(figsize=(10, 6))
plt.hist(df['revenue'], bins=50, color='skyblue')
plt.title('Distribution of Movie Revenue')
plt.xlabel('Revenue')
plt.ylabel('Count')
plt.show()

# Plotting the distribution of movie runtime

plt.figure(figsize=(10, 6))
plt.hist(df['runtime'].dropna(), bins=50, color='lightgreen')
plt.title('Distribution of Movie Runtime')
plt.xlabel('Runtime')
plt.ylabel('Count')
plt.show()

In this code, we first load the dataset using pd.read_csv() and assign it to the DataFrame df. Then we use plt.hist() to create histograms for both the 'revenue' and 'runtime' columns. The bins parameter determines the number of bins or bars in the histogram. Finally, we use plt.title(), plt.xlabel(), and plt.ylabel() to set the plot title, x-label, and y-label, respectively. Calling plt.show() displays the plots.

Q3. Can we create visualizations to understand the relationship between variables (popularity, budget) ?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas DataFrame
df = pd.read_csv('tmdb_5000_movies.csv')

# Create a scatter plot to visualize the relationship between popularity and budget

plt.figure(figsize=(10, 6))
plt.scatter(df['popularity'], df['budget'], color='orange', alpha=0.6)
plt.title('Relationship between Popularity and Budget')
plt.xlabel('Popularity')
plt.ylabel('Budget')
plt.show()

In this code, we first load the dataset using pd.read_csv() and assign it to the DataFrame df. Then we use plt.scatter() to create a scatter plot with 'popularity' on the x-axis and 'budget' on the y-axis. The color parameter sets the color of the data points, and alpha determines the transparency of the points. We also set the plot title, x-label, and y-label using plt.title(), plt.xlabel(), and plt.ylabel() respectively. Finally, calling plt.show() displays the scatter plot.


Q4. How can we visualize the correlation between features vote average and vote count ?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas DataFrame
df = pd.read_csv('tmdb_5000_movies.csv')

# Create a scatter plot to visualize the correlation between vote_average and vote_count

plt.figure(figsize=(10, 6))
plt.scatter(df['vote_average'], df['vote_count'], color='purple', alpha=0.6)
plt.title('Correlation between Vote Average and Vote Count')
plt.xlabel('Vote Average')
plt.ylabel('Vote Count')
plt.show()

In this code, we first load the dataset using pd.read_csv() and assign it to the DataFrame df. Then we use plt.scatter() to create a scatter plot with 'vote_average' on the x-axis and 'vote_count' on the y-axis. The color parameter sets the color of the data points, and alpha determines the transparency of the points. We also set the plot title, x-label, and y-label using plt.title(), plt.xlabel(), and plt.ylabel() respectively. Finally, calling plt.show() displays the scatter plot.

Q5. What are the key insights you can derive from the dataset in terms of its structure, size, and basic statistics? Create a summary report . ( hint: use pandas .describe() function ) , Also share your insights you learned by this report.

In [None]:
import pandas as pd

# Load the dataset into a pandas DataFrame
df = pd.read_csv('tmdb_5000_movies.csv')

# Summary report using describe()
summary_report = df.describe(include='all')

# Print the summary report
print(summary_report)

In this code, we load the dataset using pd.read_csv() and assign it to the DataFrame df. Then we use the describe() function on the DataFrame to generate the summary report. The include='all' parameter ensures that the report includes statistics for all columns, including categorical variables. The summary report will provide basic statistics such as count, mean, standard deviation, minimum, quartiles, and maximum for each numerical column, as well as the unique count and most frequent value for categorical columns.

By analyzing the summary report, you can derive several key insights about the dataset:

Structure: The report shows the structure of the dataset, including the number of rows and columns, the data types of each column, and basic statistics for both numerical and categorical variables. Size: You can determine the size of the dataset based on the number of rows and columns mentioned in the report. Basic statistics: The summary report provides key statistical measures such as count, mean, standard deviation, minimum, quartiles, and maximum for numerical columns. These statistics help in understanding the central tendencies, variations, and range of the data. Missing values: The count statistic in the summary report can reveal if there are any missing values in the dataset. Columns with missing values will have a count lower than the total number of rows. Distribution: The summary report gives an overview of the distribution of numerical variables by providing quartiles, which can help identify outliers or skewed distributions. Categorical variables: For categorical columns, the summary report provides the unique count and the most frequent value, which can give insights into the variety and dominant categories present in the dataset. By examining the summary report, you can gain a better understanding of the dataset's structure, size, and basic statistics, which can guide further analysis and decision-making in your data exploration process.

TASK 2 - Classification/Regression

To determine whether the problem requires classification or regression analysis, you need to consider the nature of the target variable or the problem you want to solve. In the case of the "tmdb_5000_movies.csv" dataset, the problem you want to solve will dictate whether classification or regression techniques are appropriate.

Here are a few steps to help you determine the appropriate analysis technique:

Identify the target variable: In this dataset, the target variable should be the variable you want to predict or analyze. Take a look at the available columns and decide which variable is the focus of your analysis. Determine the type of the target variable: If the target variable is categorical or discrete and represents different classes or categories, such as genre or movie status (e.g., released, in production), then the problem requires classification analysis. If the target variable is continuous and represents a quantity that can take any value within a range, such as movie revenue or popularity, then the problem requires regression analysis. Choose the appropriate analysis technique: For classification problems, you can use techniques such as logistic regression, decision trees, random forests, support vector machines (SVM), or naive Bayes classifiers. For regression problems, you can use techniques such as linear regression, decision trees, random forests, support vector regression (SVR), or gradient boosting regressors. Once you have determined the appropriate analysis technique, you can implement it using Python. Here's an example of how to perform classification or regression analysis using logistic regression and linear regression, respectively:

Classification analysis example using logistic regression:


In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load the dataset into a pandas DataFrame
df = pd.read_csv('tmdb_5000_movies.csv')

# Define the features and target variable

X = df[['feature1', 'feature2', ...]]  # Specify the relevant features
y = df['target_variable']

# Specify the target variable

# Instantiate and fit the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Make predictions
predictions = model.predict(X)

Regression analysis example using linear regression:

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load the dataset into a pandas DataFrame
df = pd.read_csv('tmdb_5000_movies.csv')

# Define the features and target variable

X = df[['feature1', 'feature2', ...]]  # Specify the relevant features
y = df['target_variable']  # Specify the target variable

# Instantiate and fit the linear regression model

model = LinearRegression()
model.fit(X, y)

# Make predictions

predictions = model.predict(X)