## 8/10 Question

You are given the below tables, showing Store, Product, and Sales information for a chain of grocery stores. The columns are labeled in such a way that you should be able to interpret what each field is showing.
    
Store


|store_id	|location|
|---|---|
|91110|	New York|
|99525|	Los Angeles|
|37340|	Tokyo|
|32016|	Detroit|
|57507|	London|

Product

|product_id|	product_name	|price_usd|
|---|---|---|
|31331|	Apples|	2|
|34611|	Lettuce	|3|
|49760|	Chicken|	5|
|26583|	Lemons|	1|
|20267|	Bread	|2|
    
Sales

|sale_id|	product_id|	store_id	|date|
|---|---|---|---|
|1|	31331|	91110|	02/20/2020|
|1|	31331|	91110|	02/20/2020|
|2|	34611|	57507|	02/20/2020|
|3|	26583|	37340|	02/20/2020|
|3|	34611|	32016|	02/20/2020|
|3|	20267|	99525|	02/21/2020|
|4|	31331|	99525|	02/21/2020|
|5|	49760|	99525|	02/21/2020|
|6|	34611|	57507|	02/21/2020|
|7|	31331|	91110|	02/21/2020|

    

    
Using the tables above, write a SQL query to return the number of sales as well as the average sale price (in dollars) for a given location.


    
Your output should return the following columns:


    
|location|	number_sales	|avg_sale_price|
|---|---|---|
|X	|Y	|Z|
|A	|B	|C|

### Approach

To solve this problem I will write a query to do as follows:
- write a subquery, which groups the data by sale id and store id and summing the order total for this grouping. This will yield a table showing the total for each sale at each store
- write an outer query, grouping the data by location. I will add columns to count the number of sales and the average amount of each sale

### Solution

select location, count(sale_id) as number_sales,avg(sale_price) as avg_sale_price <br>
from(<br>
select location,sale_id, sum(price_usd) as sale_price<br>
from store<br>
join sales<br>
on Store.store_id = Sales.store_id<br>
join product<br>
on Sales.product_id = Product.product_id<br>
group by sale_id,sales.store_id)<br>
group by location<br>

## 8/8 Question

Below is a snippet from a table that contains information about employees that work at Company XYZ:
    

    

|employee_name|	employee_id|	date_joined|	age|	yrs_of_experience|
|---|---|---|---|---|
|Andy|	123456|	2015-02-15|	45|	24|
|Beth|	789456|	NaN|	36|	15|
|Cindy|	654123|	2017-05-16|	34|	14|
|Dale|	963852|	2018-01-15|	25|	4|

    
    

    
Company XYZ recently migrated database systems causing some of the date_joined records to be NULL. You're told by an analyst in human resources NULL records for the date_joined field indicates the employees joined prior to 2010. You also find out there are multiple employees with the same name and duplicate records for some employees.

   
    
    
Given this, write code to find the number of employees that joined each month. You can group all of the null values as Dec 1, 2009.

### Explanation

To solve this problem I will:
- import the data into a dataframe to illustrate this code on the provided data
- remove rows where there are duplicate employee_id numbers (assuming that duplicates, as stated in the above, will have duplicate id values)
- remove nulls from the date field
- create new columns for month joined and years joined for a given employee
- group the data by year and month joined, and count the number of instances of each combination

### Solution

In [2]:
import pandas as pd
import numpy as np

In [31]:
data = {'employee_name': ["Andy","Beth","Cindy","Dale"], 'employee_id': [123456,789456,654123,963852],"date_joined":["02-05-2015","","05-16-2017","01-15-2018"],"age":[45,36,34,25], "yrs_of_experience":[24,15,14,4]}
august_eight_df = pd.DataFrame.from_dict(data)
august_eight_df.drop_duplicates(subset=['employee_id'])
august_eight_df['date_joined']=pd.to_datetime(august_eight_df['date_joined']).replace(np.nan,"12-01-2009")
august_eight_df['month_joined']=pd.DatetimeIndex(august_eight_df['date_joined']).month
august_eight_df['year_joined']=pd.DatetimeIndex(august_eight_df['date_joined']).year
august_eight_df

Unnamed: 0,employee_name,employee_id,date_joined,age,yrs_of_experience,month_joined,year_joined
0,Andy,123456,2015-02-05,45,24,2,2015
1,Beth,789456,2009-12-01,36,15,12,2009
2,Cindy,654123,2017-05-16,34,14,5,2017
3,Dale,963852,2018-01-15,25,4,1,2018


In [34]:
august_eight_df.groupby(['month_joined', 'year_joined']).size().reset_index(name='Number of Employees Joined')

Unnamed: 0,month_joined,year_joined,Number of Employees Joined
0,1,2018,1
1,2,2015,1
2,5,2017,1
3,12,2009,1


## 08/06 Question

Suppose you're analyzing a population of 100,000 people, and you're trying to understand life expectancy. Within this population of 100,000 people, 65% can expect to live to the age of 70, while 25% can expect to live to age 80. Given that a person is already 70, what is the probability that they live to the age 80? 

### Approach

To solve this problem, I will use conditional probability with the given information. The basic formula is: P(A|B) = P(B|A)*P(B) / P(A)

P(live to 80 | already 70) = P(live to 70 | already 80) \* P(live to 80) / P(live to 70)
P(live to 80 | already 70) = 1\*0.25/0.65 = .3846
