In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

**Question 1**: 
Using the provided dataset, write code that would calculate the correlation matrix for all the numerical features. Then write code that would create a heatmap to visually represent these correlation

In [None]:
# Given Code
data = {
    "Math_Score": np.random.randint(50, 100, 50),       
    "Science_Score": np.random.randint(50, 100, 50),    
    "English_Score": np.random.randint(50, 100, 50),    
    "Study_Hours": np.random.randint(1, 10, 50),     
    "Sleep_Hours": np.random.randint(4, 10, 50)         
}

df = pd.DataFrame(data)

In [None]:
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="plasma")
plt.show()

**Question 2:**

In [None]:
employee = {
    'Employee ID': ['1232', '1343', '1453', '1211', '1225', '1777', '1436'],
    'Employee First Name': ['James', 'John', 'Joe', 'Will', 'Mike', 'Tom', 'Mary'],
    'Employee Last Name': ['Smith', 'Jones', 'Miller', 'Jackson', 'Lopez', 'Cooper', 'Brown'],
    'Date Hired': ['10-31-2023','02-28-2024','01-09-2017','02-01-2020','11-30-2022','01-01-2020','12-25-2019'],
    'Employee Type': ['Technician', 'Technician', 'Manager', 'Supervisor', 'Technician', 'Supervisor', 'Manager'],
    'Salary': ['30000', '25000', '60000', '45000', '40000', '45000', '55000']
}
df_empl = pd.DataFrame(employee)
df_empl

a) Using this dataframe first clean the column names (ex employee_id) and make sure the data types are appropriate.

In [None]:
#clean the column names
df_empl.columns = df_empl.columns.str.lower().str.replace(' ', '_')

#change some of the data types
df_empl['date_hired']=pd.to_datetime(df_empl['date_hired'])
df_empl=df_empl.astype({'employee_id':int, 'salary':int})
df_empl.dtypes

b) Using your newly cleaned dataframe visualize the average salary for each employee type.

In [None]:
sns.barplot(data=df_empl, x='employee_type',y='salary')
plt.plot()

c) Now using pandas find the average salary for each employee type.

In [None]:
average_salary = df_empl.groupby('employee_type')['salary'].mean()
average_salary

**Question 3:**

a) Import seaborn's taxis dataset.

In [None]:
taxis = sns.load_dataset('taxis')
taxis.head()

b) What are the dimensions of the taxis dataset? What are the column names and data types? Do any of the columns have null values?

In [None]:
taxis.info()

c) Create a pair plot of the taxis data. 

In [None]:
sns.pairplot(taxis);

d) Create a correlation heat map of the taxis data. Are there any strong correlations between variables in the dataset?

In [None]:
corrmat = taxis.corr(numeric_only=True)
sns.heatmap(data=corrmat, annot=True);

There is a strong positive correlation between:
- total and distance
- total and fare
- total and tip
- total and tolls
- tolls and distance
- tolls and fare
- distance and fare

**Question 4** 

a) What is an API?

A mechanism that allows two software components to interact with each other, stands for Application Programming Interface.

b) What is the name of the data returned from an API request, typically in JSON format?

API Payload

**Question 5**

a) Import the excel file "SciDiv_Casper_Fall_24.xlsx" into a pandas dataframe called ```survey_df```. Display the *first 3 rows.***

In [None]:
survey_df = pd.read_excel('SciDiv_Caspar_Fall_24.xlsx')
survey_df.head(3)

b) You will be performing an intial EDA to determine what changed should be made to this dataset to get it into a workable format. 

Get the general (i) *information* and (ii) *descriptions* of the dataset attributes and variables. Write what you learned from each in a markdown column. No more than 3 sentences each.

In [None]:
# i 
survey_df.info()

There are 462 rows and 23 columns. Only the columns Heading.., UPC Meter Mark..., Relief, Count, and Count.1 are floats, the rests are objects. There are two duplicate column names: 'Distance (m)' and 'Distance (m).1', and 'Count' and 'Count.1'.

In [None]:
#ii
survey_df.describe()

As discussed in i, there are only 5 columns with numeric values, so we can only get that information. Count and Count.1 are not duplicate columns, even with their duplicate names. Must represent different values.

c) Look at the columns of the dataframe. What columns are representing values that would be better evaluated numerically?**

In a markdown cell, list 5 columns that would be better represented as floats or integers.

5 columns that would be better represented as a float or integer are:
1. Start Depth (ft)
2. End Depth (ft)
3. Size (mm)
4. Distance (m)
5. Distance (m).1

d) The Hard Part! *Create a function that will convert the given columns of a dataframe to numerical equivalents using regex. Assign this to*** ```survey_df```.

*Hint 1: Initial set up: ```def function(dataframe, list_of_columns):``` is a sound method to perform this task.*  
*Hint 2: Don't forget to account for numbers with a decimal and numbers without!*  
*Hint 3: Depending on how you build your function, it will either update the dataframe you enter, or create a new one.*

In [None]:
def dataconversion(dataframe_orig, list_of_columns):
    dataframe = dataframe_orig.copy()
    num_floats = {}
    for column in list_of_columns:
        string_list = list(dataframe[column])
        num_regex = r'[\d]+\.[\d]+|[\d]+'
        
        num_string = []
        for i in string_list:
            if pd.isna(i):
                num_string.append(np.nan)
            elif type(i)==int:
                num_string.append(float(i))
            elif type(i)==float:
                num_string.append(i)
            else:
                found = re.findall(num_regex, str(i))
                if found:
                    num_string.append(float(found[0]))
                else: 
                    num_string.append(np.nan)
        dataframe[column] = num_string
    return dataframe


columns_to_change = ['Visibility', 'Start Depth (ft)', 'End Depth (ft)','Size (mm)', 'Distance (m)','Distance (m).1',]

survey_df = dataconversion(survey_df,columns_to_change)
survey_df

**Question 6:** What are the 7 components of a graph and what do they mean?

Data: the data that fills the graph  
Aesthetic mappings: The variables assigned to the axes of the graph  
Scales: controls mapping data to aesthetic attributes. allows for visualization of continuous or categorical values.  
Geometric objects: control the type of plot you create. This is the type of visualization you use, points, bars, lines, etc.  
Statistical transformations: How the data is being summerized in the graph, sum, bin, density, etc.  
Facets: The number of subplots on your graph. Each subplot can contain a unqiue chart.  
Coordinate system: The numerical mapping of your data to the axes such as Cartesian.

**Question 7:**

[The Cat API](https://thecatapi.com/) is a free to use API for cat images and other information. Read through the documentation to answer the questions below. You do not need to create an account in order to answer the following questions! Make sure to run the import cell below.

In [None]:
import requests
from IPython.display import Image, display

(a) What is the base url for API requests for TheCatAPI? Assign it to `cat_base_url`

In [None]:
cat_base_url = 'https://api.thecatapi.com/v1'

(b) Create the URL to request 10 random cat images, assign it to `cat_images_url`

In [None]:
cat_images_url = 'https://api.thecatapi.com/v1/images/search?limit=10'

(c) Send an API request using your URl, assign it to `cat_response`. Assign the payload to `cat_payload`

In [None]:
cat_response = requests.get(cat_images_url)
cat_payload = cat_response.json()

(d) Inspect your `cat_payload`, what type of data structure is it? Choose an answer from below and assign it to `data_type`:

1. list of strings
2. dictionary of lists
3. list of lists
4. list of dictionaries

In [None]:
data_type = 4

(e) Using an appropriate method, assign all the urls from your API request to `cat_images`

In [None]:
cat_images = [x['url'] for x in cat_payload]

(f) View your cats! Create a function called `cat_gallery` that will use the urls stored in `cat_images` to display your cat images. 
Use `display(Image(url=url_string))` within your function to produce images in your output. 
*Hint:  You should return nothing at the end of your function*

In [None]:
def cat_gallery(a_list):
    for url_string in a_list:
        display(Image(url=url_string))
    return
cat_gallery(cat_images)

**Question 8**

Go to this [Star Wars API website](https://swapi.dev) and find the base URL to request Star Wars data.

In [None]:
base_url = 'https://swapi.dev/api'

You want to find out how to access information on starships, what endpoint URL should you use?

In [None]:
endpoint = '/starships/'

Use the base URL and endpoint to get the json payload of starships

In [None]:
url = base_url+endpoint

In [None]:
response = requests.get(url)
response

In [None]:
starships_data = response.json()
starships_data

**Question 9**

Given a DataFrame containing first and last names, add a column containing abbreviations made up of the first and last initials (capitalized), both followed by periods (ex. John Owens yields J.O., Amy Watson yields A.W.).

In [None]:
import pandas as pd
data = {'first_name': ['John', 'Amy', 'Evan', 'Jane'],
     'last_name': ['Owens','Watson','Guerrero','Jones']}
names_df = pd.DataFrame(data)

In [None]:
names_df['abbreviation'] = names_df.apply(lambda x: x['first_name'][0]+'.'+x['last_name'][0]+'.', axis=1)
names_df

**Question 10**

Use the data below called names to create a new column with only the last names of each person.

In [None]:
data = {
    'Name': ['Jack Arnold', 'Steven Hughes', 'Alex Soup', 'John Brown', 'Alan Shwartz'],
    'Occupation': ['Mechanic', 'Dentist', 'Trainer', 'Ranger', 'Police Officer'],
    'Age': [34, 39, 28, 37, 41],
    'Favorite Color': ['White', 'Gray', 'Orange', 'Brown', 'Red'],
    'Happiness Score': [7.5, 8.0, 7.2, 2.3, 4.8]
}
names = pd.DataFrame(data)
names

In [None]:
names['Last Name'] = names['Name'].str.split().str[1]
names

**Question 11** 

Given the dataframe below, create a barchart which has a height of the count value, a hue by color, facetted by object using seaborn  

HINT: Use a catplot perhaps or a facet grid which we can map a barchart onto

In [None]:
df = pd.DataFrame({'Object': ['shoes','shoes', 'umbrella', 'umbrella', 'umbrella' ,'shirt', 'shirt' ], 'Color': ['yellow', 'blue', 'red', 'blue', 'purple', 'yellow', 'green'], 'Count': [12, 20, 15, 22, 13, 17, 21]})
df

In [None]:
sns.catplot(data=df, kind='bar', x='Color', y='Count', col='Object');

**Question 12**

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/ronniebugia/steph-curry-data/refs/heads/master/stephen_curry.csv')
df = df.loc[:9]
df.head()

Above are the career per game statistics for Golden State Warriors Point Guard Stephen Curry. Using the statistics from the dataframe can you find out the season in which Curry made his 3000th three pointer?

In [None]:
df['3P_total'] = df['3P'] * df['G']

In [None]:
df['Cumulative_3s'] = df['3P_total'].cumsum()

In [None]:
three_k = df[df['Cumulative_3s'] >= 3000].index.min()

In [None]:
df.loc[three_k]