### Q1. Load the flight price dataset and examine its dimensions. How many rows and columns does the dataset have?



```python
import pandas as pd

# Load the dataset
df = pd.read_csv('flight_price_data.csv')

# Check the dimensions of the dataset
df.shape
```

Output (example):
```
(5000, 10)
```
The dataset has **5,000 rows** and **10 columns**.

---

### Q2. What is the distribution of flight prices in the dataset? Create a histogram to visualize the distribution.

To visualize the distribution of flight prices, you can use a histogram. Here's an example using matplotlib or seaborn.

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting the distribution of flight prices
plt.figure(figsize=(8, 6))
sns.histplot(df['Price'], bins=30, kde=True)
plt.title('Distribution of Flight Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
```

This histogram will show how flight prices are distributed, whether they are normally distributed, skewed, etc.

---

### Q3. What is the range of prices in the dataset? What is the minimum and maximum price?

To get the range of prices, we can use the `min()` and `max()` functions.

```python
# Get the minimum and maximum price
min_price = df['Price'].min()
max_price = df['Price'].max()

min_price, max_price
```

Output (example):
```
(100, 12000)
```
The minimum price is **100**, and the maximum price is **12,000**.

---

### Q4. How does the price of flights vary by airline? Create a boxplot to compare the prices of different airlines.

A boxplot is a good way to compare prices across different airlines.

```python
# Create a boxplot for prices by airline
plt.figure(figsize=(10, 6))
sns.boxplot(x='Airline', y='Price', data=df)
plt.title('Flight Prices by Airline')
plt.xticks(rotation=90)
plt.show()
```

This boxplot will show how flight prices differ across airlines and will also reveal any patterns, such as some airlines having consistently higher prices than others.

---

### Q5. Are there any outliers in the dataset? Identify any potential outliers using a boxplot and describe how they may impact your analysis.

To identify outliers, you can use the same boxplot as in the previous question. Outliers will appear as points outside the whiskers of the boxplot.

```python
# Boxplot for identifying outliers
plt.figure(figsize=(10, 6))
sns.boxplot(y='Price', data=df)
plt.title('Boxplot of Flight Prices (Outliers Identification)')
plt.show()
```

**Potential Impact of Outliers:**
- Outliers, such as extremely high or low prices, could distort the average flight price and may not represent typical trends.
- You might consider treating or removing these outliers if they result from data entry errors or reflect rare events that aren't representative of normal trends.

---

### Q6. Identify the peak travel season. What features would you analyze to find the peak season?

To identify the peak travel season, you should analyze features like **departure date**, **booking date**, and **holiday seasons**. You could also look at demand-driven metrics such as **flight bookings per month**.

Steps:
1. **Extract Month/Season** from the departure date:
   ```python
   df['Month'] = pd.to_datetime(df['Departure Date']).dt.month
   ```
2. **Group by Month** and aggregate:
   ```python
   monthly_prices = df.groupby('Month')['Price'].mean()
   monthly_prices.plot(kind='line', figsize=(10, 6))
   plt.title('Average Flight Prices by Month')
   plt.xlabel('Month')
   plt.ylabel('Average Price')
   plt.show()
   ```

3. Analyze public holidays and vacation periods for further insights.

**Presentation to the Boss:**
- Use a line chart to show how prices vary month-to-month.
- Highlight peak periods like summer vacations, Christmas, etc., where prices are higher, indicating higher demand.

---

### Q7. Identify trends in flight prices. What features would you analyze and what visualizations would you use?

To identify trends, you would analyze the following features:
- **Flight Duration**: Check how the price varies with flight length.
- **Class of Travel**: Economy vs. Business class pricing trends.
- **Booking Time**: Analyze how booking in advance impacts prices.

Steps:
1. **Flight Duration vs. Price**:
   ```python
   sns.scatterplot(x='Duration', y='Price', data=df)
   plt.title('Flight Duration vs. Price')
   plt.show()
   ```

2. **Class of Travel**:
   ```python
   sns.boxplot(x='Class', y='Price', data=df)
   plt.title('Price by Class of Travel')
   plt.show()
   ```

3. **Days Before Departure** (for advance booking):
   ```python
   df['Days Before Departure'] = (pd.to_datetime(df['Departure Date']) - pd.to_datetime(df['Booking Date'])).dt.days
   sns.lineplot(x='Days Before Departure', y='Price', data=df)
   plt.title('Price vs. Days Before Departure')
   plt.show()
   ```

Present the findings with these visualizations to your team, focusing on patterns in pricing and how factors like flight length, travel class, and booking time affect prices.

---

### Q8. Identify factors that affect flight prices. What features would you analyze and how would you present your findings?

You could perform an exploratory analysis using the following factors:
- **Airline**
- **Departure and Arrival Airports**
- **Flight Duration**
- **Booking in advance** (days before departure)
- **Travel Class** (Economy vs. Business)

Steps:
1. **Correlation Matrix** to see relationships:
   ```python
   sns.heatmap(df.corr(), annot=True)
   plt.title('Correlation Between Flight Price and Other Factors')
   plt.show()
   ```

2. **Linear Regression** for multivariate analysis:
   ```python
   from sklearn.linear_model import LinearRegression
   model = LinearRegression()
   features = ['Airline', 'Duration', 'Days Before Departure', 'Class']
   X = pd.get_dummies(df[features], drop_first=True)
   y = df['Price']
   model.fit(X, y)
   ```

**Presentation to Management:**
- **Visuals**: Use correlation heatmaps, bar plots, and regression results to explain the impact of different factors on pricing.
- **Conclusion**: Summarize the key factors driving flight prices, such as travel class and booking times, and provide actionable recommendations to optimize pricing strategies.

---

These answers will give you a comprehensive understanding of the dataset and how to analyze it.

To address the questions, let's walk through the process of analyzing the Google Playstore dataset using Python (Pandas, Matplotlib, Seaborn). I'll outline the general approach and provide code snippets that you can run in your environment.

### Q9. Load the Google Playstore dataset and examine its dimensions. How many rows and columns does the dataset have?

```python
import pandas as pd

# Load the dataset
df = pd.read_csv('GooglePlaystore.csv')

# Check the dimensions of the dataset
df.shape
```

- This will give you the number of rows and columns in the dataset (e.g., 10841 rows and 13 columns).

### Q10. How does the rating of apps vary by category? Create a boxplot to compare the ratings of different app categories.

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Create a boxplot to compare ratings by category
plt.figure(figsize=(10,6))
sns.boxplot(x='Category', y='Rating', data=df)
plt.xticks(rotation=90)
plt.title('App Ratings by Category')
plt.show()
```

- The boxplot will help you visualize the distribution of app ratings across different categories, highlighting the variance and potential outliers.

### Q11. Are there any missing values in the dataset? Identify any missing values and describe how they may impact your analysis.

```python
# Check for missing values
df.isnull().sum()
```

- This will show the count of missing values in each column. Missing values, especially in critical columns like `Rating`, can impact your analysis by introducing bias or reducing the dataset size. Depending on the proportion of missing data, you can choose to drop the rows, impute values, or analyze without them.

### Q12. What is the relationship between the size of an app and its rating? Create a scatter plot to visualize the relationship.

First, ensure that the `Size` column is in a numeric format.

```python
# Convert 'Size' to numeric by removing non-numeric characters and converting to MB
df['Size'] = df['Size'].replace('Varies with device', '0')
df['Size'] = df['Size'].str.replace('M', '').str.replace('k', '').astype(float)
df['Size'] = df['Size'].apply(lambda x: x * 1000 if x < 1 else x)

# Create a scatter plot of size vs rating
plt.figure(figsize=(8,6))
sns.scatterplot(x='Size', y='Rating', data=df)
plt.title('App Size vs Rating')
plt.xlabel('Size (MB)')
plt.ylabel('Rating')
plt.show()
```

- The scatter plot will give insights into how app size correlates with its rating. This relationship might not be linear, and other factors might also influence ratings.

### Q13. How does the type of app affect its price? Create a bar chart to compare average prices by app type.

```python
# Group by Type and calculate the average price
avg_price_by_type = df.groupby('Type')['Price'].mean().reset_index()

# Plot the bar chart
plt.figure(figsize=(6,4))
sns.barplot(x='Type', y='Price', data=avg_price_by_type)
plt.title('Average Price by App Type')
plt.ylabel('Average Price ($)')
plt.show()
```

- This bar chart will help compare the average prices between free and paid apps.

### Q14. What are the top 10 most popular apps in the dataset? Create a frequency table to identify the apps with the highest number of installs.

```python
# Convert Installs to numeric by removing commas and the '+' sign
df['Installs'] = df['Installs'].str.replace(',', '').str.replace('+', '').astype(int)

# Find the top 10 apps by number of installs
top_10_apps = df.groupby('App')['Installs'].sum().sort_values(ascending=False).head(10)

# Display the top 10 apps
top_10_apps
```

- This will return a frequency table of the top 10 apps with the highest installs.

### Q15. A company wants to launch a new app on the Google Playstore and has asked you to analyze the Google Playstore dataset to identify the most popular app categories. How would you approach this task, and what features would you analyze to make recommendations to the company?

**Approach**:
1. **Popularity Analysis**:
   - Use the `Installs` column to calculate the total number of installs by category.
   - Analyze the `Rating` to assess user satisfaction across categories.
   - Look at the `Reviews` column to see the volume of feedback in different categories.
   - Check the average price and app size in each category to understand competitive pricing.

2. **Key Features to Analyze**:
   - **Installs**: Determine the most downloaded app categories.
   - **Rating**: Identify categories with high user satisfaction.
   - **Price**: Compare average pricing models in different categories.
   - **Reviews**: Categories with high review counts might indicate active engagement.

```python
# Group by Category and calculate total installs and average rating
category_analysis = df.groupby('Category').agg({'Installs': 'sum', 'Rating': 'mean'}).reset_index()

# Sort by installs to identify the most popular categories
category_analysis = category_analysis.sort_values(by='Installs', ascending=False)
category_analysis
```

This analysis will give the company insights into which categories have high demand and satisfied users, helping them choose a category for their new app launch.