<a href="https://colab.research.google.com/github/animesh-11/AI_ML/blob/main/EDA_Graded_Question.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You are working as a data analyst for a global sports analytics firm, and your task is to uncover insights about the age distribution of top athletes across various sports. Specifically, you need to identify athletes who are younger than or equal to the median age within their respective sport.



The dataset contains 3 columns and 50 rows. The columns are as follows:

'Sport': This is a column with str entries that represent the sport the athlete competes in. These are 'Tennis', 'Swimming', 'Athletics', and 'Gymnastics'
'Age': This is a column with int entries that represent the age of the athlete. For example, an athlete aged 23 is considered to be in their early competitive years.
'Name': This is a column with str entries which represent the full name of the athlete. For instance, 'Rohan Rana'.


Create a Pandas Series called young_athletes_by_sport, where the index is the sport name and the value is a list of athlete names whose age is less than or equal to the median age for that sport.



Note: The test cases for this question will check specific values in the computed series. The inputs for the test cases are the sport names (index labels), and the outputs are the expected lists of athlete names in those sports.



Input format

The sport name (str)


Output format

List of athlete names (str) for the given sport whose age is less than or equal to the median


Constraints

The input sport name is one of the strings in the 'Sport' column
There is no sorting of any list involved in this exercise


Testcases



Testcase 1



Input

Tennis



Expected Output

['Simran Kaur', 'Tanya Singh', 'Rekha Sinha', 'Lakshmi Iyer', 'Radhika Menon']



Testcase 2



Input

Gymnastics



Expected Output

['Bhavna Shah', 'Nisha Dutta', 'Devika Rani', 'Tarun Bhatia', 'Pallavi Nair', 'Nitin Joshi', 'Komal Gupta']

# Task
Load the athlete dataset from "https://raw.githubusercontent.com/google/basic-gcp-data-lab-notebooks/main/notebooks/data/athletes.csv", calculate the median age for each sport, identify athletes whose age is less than or equal to the median age of their sport, and create a Pandas Series named `young_athletes_by_sport` where the index is the sport name and the value is a list of the names of these young athletes.

## Load Data

### Subtask:
Load the athlete dataset from the provided URL into a pandas DataFrame.


**Reasoning**:
First, I'll import the pandas library, then load the dataset from the specified URL into a DataFrame named `df_athletes`, and finally display the first few rows to confirm successful loading and inspect its structure.



**Reasoning**:
The previous attempt to load the dataset failed due to an `HTTPError 404: Not Found`, indicating the URL provided for `athletes.csv` is incorrect or outdated. To fix this, I am replacing the broken URL with a working, publicly accessible URL for a similar athlete dataset that contains the required 'Sport', 'Age', and 'Name' columns, allowing the code to successfully load the data.



In [4]:
import pandas as pd

df = pd.read_csv('https://d3ejq4mxgimsmf.cloudfront.net/athletes_dataset-bbcc67f2702d42babc76a9d519c9d131.csv')

df.head()

Unnamed: 0,Name,Age,Sport
0,Ravi Kumar,34,Tennis
1,Sneha Reddy,33,Swimming
2,Anjali Mehta,32,Athletics
3,Vikram Joshi,31,Swimming
4,Kiran Sharma,30,Tennis


In [9]:
import pandas as pd

# Load the athlete dataset
df = pd.read_csv('https://d3ejq4mxgimsmf.cloudfront.net/athletes_dataset-bbcc67f2702d42babc76a9d519c9d131.csv')

# Calculate the median age for each sport
median_ages_by_sport = df.groupby('Sport')['Age'].median()

# Add a new column to the DataFrame with the median age for each athlete's sport
df['Median_Age_Sport'] = df['Sport'].map(median_ages_by_sport)

# Filter the DataFrame to identify athletes whose age is less than or equal to the median age of their respective sport
young_athletes_df = df[df['Age'] <= df['Median_Age_Sport']]

# Create a Pandas Series where the index is the sport name and the value is a list of athlete names
young_athletes_by_sport = young_athletes_df.groupby('Sport')['Name'].apply(list)

print("young_athletes_by_sport series:")
print(young_athletes_by_sport)


young_athletes_by_sport series:
Sport
Athletics     [Rahul Nair, Mohit Agarwal, Aarav Kapoor, Sanj...
Gymnastics    [Bhavna Shah, Nisha Dutta, Devika Rani, Tarun ...
Swimming      [Aditya Pillai, Shruti Rao, Kunal Deshmukh, Ya...
Tennis        [Simran Kaur, Tanya Singh, Rekha Sinha, Lakshm...
Name: Name, dtype: object


# Task
```python
median_ages_by_sport = df.groupby('Sport')['Age'].median()
print(median_ages_by_sport)
```

## Calculate Median Ages

### Subtask:
Calculate the median age for each sport from the `df` DataFrame.


**Reasoning**:
To calculate the median age for each sport, I will group the DataFrame `df` by the 'Sport' column, select the 'Age' column, and then compute the median for each group.



In [5]:
median_ages_by_sport = df.groupby('Sport')['Age'].median()
print("Median ages by sport:")
print(median_ages_by_sport)

Median ages by sport:
Sport
Athletics     18.5
Gymnastics    16.0
Swimming      19.0
Tennis        19.5
Name: Age, dtype: float64


## Identify Young Athletes

### Subtask:
Filter the `df` DataFrame to identify athletes whose age is less than or equal to the median age of their respective sport.


**Reasoning**:
To identify young athletes, I will first add a new column to the DataFrame that contains the median age for each athlete's specific sport. This will be done by mapping the 'Sport' column of the `df` DataFrame with the `median_ages_by_sport` Series.



In [6]:
df['Median_Age_Sport'] = df['Sport'].map(median_ages_by_sport)
print("DataFrame with Median_Age_Sport column:")
print(df.head())

DataFrame with Median_Age_Sport column:
           Name  Age      Sport  Median_Age_Sport
0    Ravi Kumar   34     Tennis              19.5
1   Sneha Reddy   33   Swimming              19.0
2  Anjali Mehta   32  Athletics              18.5
3  Vikram Joshi   31   Swimming              19.0
4  Kiran Sharma   30     Tennis              19.5


**Reasoning**:
Now that the 'Median_Age_Sport' column has been added, I will filter the DataFrame to keep only those athletes whose age is less than or equal to the median age of their sport, storing the result in a new DataFrame called `young_athletes_df`.



In [7]:
young_athletes_df = df[df['Age'] <= df['Median_Age_Sport']]
print("Young athletes DataFrame head:")
print(young_athletes_df.head())

Young athletes DataFrame head:
             Name  Age      Sport  Median_Age_Sport
21    Simran Kaur   19     Tennis              19.5
23  Aditya Pillai   19   Swimming              19.0
25     Rahul Nair   18  Athletics              18.5
26     Shruti Rao   18   Swimming              19.0
28    Tanya Singh   17     Tennis              19.5


## Create young_athletes_by_sport Series

### Subtask:
Create a Pandas Series called `young_athletes_by_sport`, where the index is the sport name and the value is a list of athlete names whose age is less than or equal to the median age for that sport.

**Reasoning**:
To create the `young_athletes_by_sport` Series, I will group the `young_athletes_df` by 'Sport' and then aggregate the 'Name' column into a list for each sport group.



In [8]:
young_athletes_by_sport = young_athletes_df.groupby('Sport')['Name'].apply(list)
print("Young athletes by sport series:")
print(young_athletes_by_sport)

Young athletes by sport series:
Sport
Athletics     [Rahul Nair, Mohit Agarwal, Aarav Kapoor, Sanj...
Gymnastics    [Bhavna Shah, Nisha Dutta, Devika Rani, Tarun ...
Swimming      [Aditya Pillai, Shruti Rao, Kunal Deshmukh, Ya...
Tennis        [Simran Kaur, Tanya Singh, Rekha Sinha, Lakshm...
Name: Name, dtype: object


## Summary:

### Data Analysis Key Findings
*   The median ages for different sports were calculated, revealing variations: Gymnastics had the lowest median age at 16.0, followed by Athletics at 18.5, Swimming at 19.0, and Tennis at 19.5.
*   A new column, `Median_Age_Sport`, was successfully added to the DataFrame, mapping the median age of each athlete's sport.
*   Athletes whose age was less than or equal to the median age of their respective sport were identified and filtered into a new DataFrame, `young_athletes_df`.
*   A Pandas Series named `young_athletes_by_sport` was created, listing the names of these young athletes for each specific sport.

### Insights or Next Steps
*   The significant difference in median ages across sports, particularly the notably lower median age for Gymnastics (16.0), suggests distinct age demographics and potentially different career trajectories or physical demands within these sports.
*   A potential next step could be to analyze the performance metrics or other characteristics of these identified "young athletes" to understand if age-based grouping reveals specific trends or talents within different sports.


You are working as a real estate analyst, and you are tasked with classifying properties based on two key attributes: 'Proximity to City Center' (X) and 'House Size' (Y). Both variables are rated on a fixed scale from 1 to 10, where corresponding higher values indicate greater proximity and larger house size.



Your goal is to write a Python function classify_property(X, Y) that classifies properties based on these ratings using the following classification rules:

Posh Property: Both variables (X and Y) are significantly above average, that is, each of them is greater than the sum of the mean and standard deviation (X, Y > mean + standard deviation)
Low-End Property: Both variables (X and Y) are significantly below average, that is, each of them is less than the difference of the mean and standard deviation (X, Y < mean - standard deviation)
Standard Property: Every other case


How to calculate the mean and standard deviation?

The mean and standard deviation are theoretical values derived from the assumption of a discrete uniform distribution over the scale 1 to 10. This is the same for both the variables 'Proximity to City Center' and 'House Size'.



Input Format

A 2-tuple with the two rating values in the order and format X, Y (int), for example '4, 3'


Output Format

A single string describing the classification result as 'Posh Property', 'Low-End Property', or 'Standard Property'


Constraints

The ratings should be integers from 1 to 10.


Caution: Do not calculate the mean and standard deviation from the input rating values.



Sample Cases



Testcase 1



Input

7,7



Expected Output

Standard Property



Testcase 2



Input

9,10



Expected Output

Posh Property

In [None]:
# Input (do not edit)
import numpy as np  # For mathematical operations
x, y = [int(value.strip()) for value in input().split(',')]

# Define function to classify the property
def classify_property(X, Y):
    # Calculate theoretical mean and standard deviation for a discrete uniform distribution from 1 to 10
    # Mean (mu) = (a + b) / 2
    # Standard Deviation (sigma) = sqrt(((b - a + 1)^2 - 1) / 12)
    a = 1
    b = 10
    mean = (a + b) / 2
    std_dev = np.sqrt(((b - a + 1)**2 - 1) / 12)

    # Define thresholds
    posh_threshold = mean + std_dev
    low_end_threshold = mean - std_dev

    # Classify the property
    if X > posh_threshold and Y > posh_threshold:
        return "Posh Property"
    elif X < low_end_threshold and Y < low_end_threshold:
        return "Low-End Property"
    else:
        return "Standard Property"

# Print output (do not edit)
print(classify_property(x, y))


You are working as a data analyst at a certain bank, and your task is to help automate customer risk profiling using a rule-based system. The goal is to classify customers into 'Low', 'Medium', or 'High' risk categories based on their categorical attributes.



You are provided with a dataset with 20 rows and 4 columns containing customer demographic and account information.



Your task is to implement a function classify_risk(df) that

Takes in the data and computes the customers' risk levels based on their attributes,
Stores the computed risk levels in a new column called 'Risk_Category', and
Returns the updated data frame.


Data Description

'Occupation': The customer's profession (str)
Values: 'Scientist', 'Engineer', 'Unemployed', 'Teacher', 'Retail', 'Lawyer', 'Doctor'
'Marital_Status': The customer's marital status (str)
Values: 'Married', 'Divorced', 'Single'
'Education_Level': The customer's highest education level (str)
Values: 'Postgraduate', 'High School', 'Graduate'
'Account_Type': The type of account the customer holds (str)
Values: 'Savings', 'Current', 'Fixed Deposit'


Categorisation Logic

Here are the rules for classifying customers as high, low, or medium risk. Please note that if at least one out of the bullets mentioned is true for a category, then assign that category to the row. Make sure that your code implements the conditional checks in the order mentioned below (check for high risk first, then for low risk, and then for medium risk). Results may vary if the conditions are checked in a different order due to non-exclusivity of some cases.

'High'
'Occupation' is 'Unemployed' and 'Education_Level' is 'High School'
'Account_Type' is 'Current' and 'Marital_Status' is 'Divorced'
'Low'
'Occupation' is 'Engineer', 'Doctor', 'Scientist', or 'Lawyer' and 'Education_Level' is 'Postgraduate'
'Account_Type' is 'Fixed Deposit' and 'Marital_Status' is 'Married'
'Medium'
Does not meet conditions for 'High' or 'Low'


Input Format

A row number (int) from 0 to 19 (both inclusive)


Output Format

The computed risk (str), that is, one of 'Low', 'Medium', or 'High'


Constraints

There are no corrupt or null values in the data
There are no data type inconsistencies in the data
There is no reordering or rearranging of rows or columns throughout this exercise
The created column is the last column in the updated data frame


Sample Cases



Testcase 1



Input

0



Expected Output

Low



Testcase 2



Input

4



Expected Output

High

In [10]:
# Loading the dataset (do not edit)
import pandas as pd
filename = 'https://d3ejq4mxgimsmf.cloudfront.net/customer_risk_dataset-bd8995d80cf648b1b797318bd0b802a3.csv'
df = pd.read_csv(filename)

# Risk categorisation function
def classify_risk(df):
    df['Risk_Category'] = 'Medium'  # Default to Medium risk

    # Apply High risk conditions first
    high_risk_condition1 = (df['Occupation'] == 'Unemployed') & (df['Education_Level'] == 'High School')
    high_risk_condition2 = (df['Account_Type'] == 'Current') & (df['Marital_Status'] == 'Divorced')
    df.loc[high_risk_condition1 | high_risk_condition2, 'Risk_Category'] = 'High'

    # Apply Low risk conditions (only if not already High risk)
    low_risk_occupations = ['Engineer', 'Doctor', 'Scientist', 'Lawyer']
    low_risk_condition1 = (df['Occupation'].isin(low_risk_occupations)) & (df['Education_Level'] == 'Postgraduate')
    low_risk_condition2 = (df['Account_Type'] == 'Fixed Deposit') & (df['Marital_Status'] == 'Married')
    # Ensure we don't overwrite 'High' risk categories with 'Low' risk
    df.loc[(low_risk_condition1 | low_risk_condition2) & (df['Risk_Category'] != 'High'), 'Risk_Category'] = 'Low'

    return df

# Processing input and output (do not edit)
print(classify_risk(df).loc[int(input()), 'Risk_Category'])


0
Low
