## 7/13 Question

Suppose you are given a SQL table containing the brands of shoes (field name: brand) as well as the shoe price (field name: price). The database is called shoe_info. You are tasked with updating the prices in this database for a few brands of shoes. Specifically, you have been asked to update all Nike shoe prices to 100, and all Adidas shoe prices to 85. Using SQL, write a query to perform this action.

### Approach

To solve this problem, I will use a case statement to update the table. In the case where a row has the Nike or Adidas brand, I will update the price value to the specified dollar amount. For all other rows I will keep the price the same and end the case statement.

### Solution

UPDATE shoe_info <br>
SET price=CASE <br>
     WHEN brand="Nike" THEN 100<br>
     WHEN brand="Adidas" THEN 85<br>
     ELSE price END<br>

## 7/11 Question

Given the following dataframe, write code using Python (Pandas) to return the rows that contain the string 'J' in the name column (to practice searching string contains).

Next, write code to return all rows where favorite_color does not contain the string 'r'.


|Age|	Favorite Color	|Grade|	Name|
|---|---|---|---|  
|20	|blue	|88	|Willard Morris|
|19	|blue	|95	|Al Jennings|
|22	|yellow	|92	|Omar Mullins|
|21	|green	|70	|Spencer McDaniel|

### Approach

To solve this problem, I will use the pandas str.contains method to filter the dataframe to rows where name doesn't contain 'J. I will then invert that command for the second task to find all rows where the row doesn't contain an 'r' in a name.

### Solution

In [5]:
import pandas as pd
import numpy as np

d = {'Age': [20,19,22,21], 'Favorite Color': ['blue','blue','yellow','green'], 'Grade':[88,95,92,70], 'Name':['Willard Morris','Al Jennings','Omar Mullins','Spencer McDaniel']}
df = pd.DataFrame(data=d)

df_no_J = df[df.Name.str.contains('J')]

In [6]:
df_no_J

Unnamed: 0,Age,Favorite Color,Grade,Name
1,19,blue,95,Al Jennings


In [7]:
df_no_r = df[~df.Name.str.contains('r')]

In [8]:
df_no_r

Unnamed: 0,Age,Favorite Color,Grade,Name
1,19,blue,95,Al Jennings


## 7/8 Question

Give an example of when you would want to use a One Way ANOVA test. Walk through the example, your reasoning for choosing a One Way ANOVA, and the steps you would take to run the test.

## 7/6 Question

Suppose you're given an array of integers of length N. Write a function using Python to select j #s that will maximize the absolute difference between the j numbers chosen and those remaining in the array.

 
    
Examples:


    
    
    
Input: arr[ ] = [2, 6, 2, 1, 10]
    

j = 2
    

#Output: Here 1,2 are selected, with the difference between the sum of j (3) 
    

#and the remaining #s (6+10+2 = 5) being 15
    


    

Input: arr[ ] = [1, 4, 3, 2, 4]
    

j = 4
    

#Output: Here we would select 4, 4, 3, and 2 for a sum of 13, whereas the sum of the remaining # is 1.
    

#Therefore, the difference between j and our remaining # is 12

In [21]:
def find_mad(arr,j):
    arr.sort()
    if len(arr)//2>j:
        smaller_sum = sum(arr[:j])
        larger_sum=sum(arr[j:])
        mad = larger_sum - smaller_sum
    else:
        smaller_sum = sum(arr[:-j])
        larger_sum=sum(arr[-j:])
        mad = larger_sum - smaller_sum
    return mad

In [26]:
find_mad([1, 4, 3, 2, 4],2)

2

## 07/04 Question

Suppose you work for Airbnb as an analyst. A team has come to you asking which cities generate the highest revenue for the company in 2017. Using the schemas below, write a SQL query to answer this question.

You have a table with property location information and another table with stay information. The schema of the tables are below:


    
Table: property_location_info


    
|Column Name|	Data Type|	Description|
|---|---|---|
|property_id|	integer|	ID of the property location|
|country|	string|	country code of the property location|
|city_name|	string|	name of city (note there can be multiple cities with the same name)|
|subregion_name|	string|	provence, state, or subregion name|
|address|	string|	address of property location|

    

    
Table: stays_info


    
|Column Name|	Data Type|	Description|
|---|---|---|
|guest_id|	integer|	ID of guest|
|property_id|	integer|	ID of the property location|
|host_id|	integer|	ID of the host managing the property|
|revenue|	integer|	cost of stay for guest in USD|
|date_start|	string|	start day of stay, format is "YYYY-mm-dd"|
|date_end|	string|	end day of stay, format is "YYYY-mm-dd"|
|stay_length|	integer|	number of days for the stay|
|airbnb_revenue|	integer|	revenue that Airbnb collected on stay|

### Approach

To solve this problem I will create a query that:
- joins the two tables on property_id
- group the dataset by city name
- select only results that have a start year of 2017, by converting the start_date string to a date value and extracting year from it
- ordering the dataset by the sum of company revenue descending so top earning sums show first
- selects the city name, and sum of revenue by city in the final query

### Solution

SELECT city_name, sum(airbnb_revenue) AS AIRBNB_revenue_by_city <br>
FROM<br>
property_location_info <br>
JOIN stays_info on property_location_info.property_id = stay_info.property_id<br>
GROUP BY city_name<br>
HAVING EXTRACT(YEAR FROM CAST(date_start AS DATE))=2017<br>
order by sum(airbnb_revenue) DESC<br>

## 07/01 Question

Suppose you have the following dataset*, which is a list of leaders for all independent states in the world as outlined in Gleditsch and Ward.


    
    
With this data, for all leaders that have valid birth and death years, can you plot the life expectancy over time for these leaders?


    
    
Note: Here I would probably recommend a box and whisker plot to show distributions over time in the same chart..

### Approach

For this problem I will:
- import the relevant libraries
- import the dataset
- filter the data so only leaders with birth and death dates remain
- create a new field to calculate the age each leader died
- create a field rounding the birth year of a leader to the nearest decade
- use plotly to plot age at death grouped by year born
- use plotly to plot age at death grouped by decade-group born, because the above plot shows a trend but individual results are not easy to look at in detail

### Solution

In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

In [17]:
data_07_01 = pd.read_csv("./Data/2022_07_01_data.csv")

In [18]:
data_07_01 = data_07_01[(data_07_01.yrborn>0) & (data_07_01.yrdied>0)]

In [22]:
data_07_01['age_at_death'] = data_07_01.yrdied - data_07_01.yrborn
data_07_01['age_decade_rounded'] = data_07_01.yrborn.round(-1)

In [39]:
fig = px.box(data_07_01,x='yrborn',y='age_at_death',labels=dict(yrborn="Year Born", age_at_death="Age At Death"))
fig.update_layout(title_text='Leader Life Expectancy by Year Born', title_x=0.5)
fig.show()

In [41]:
fig = px.box(data_07_01,x='age_decade_rounded',y='age_at_death',labels=dict(age_decade_rounded="Decade Born", age_at_death="Age At Death"))
fig.update_layout(title_text='Leader Life Expectancy by Decade Born', title_x=0.5)
fig.show()
