<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%204/logo.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Dash Basics: HTML and Core Components**

### **Objectives**

After completing the lab you will be able to:

* Create a dash application layout
* Add HTML H1, P, and Div components
* Add core graph component
* Add multiple charts

### **Data Used**

[Airline Reporting Carrier On-Time Performance](https://developer.ibm.com/exchanges/data/all/airline?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) dataset from <a href="https://developer.ibm.com/exchanges/data?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork-20297740&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ">Data Asset eXchange</a>



## **Let's start creating *Dash* application**

#### **Goal**

Create a dashboard that displays the percentage of flights running under specific distance group. Distance group is the distance intervals, every 250 miles, for flight segment. If the flight covers to 500 miles, it will be under distance group 2 (250 miles + 250 miles).

#### **Expected Output**
Below is the expected result from the lab. Our dashboard application consists of three components:

* Title of the application
* Description of the application
* Chart conveying the proportion of distance group by month

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%204/images/basics_output.png"  alt="cognitiveclass.ai logo">

#### **To do:**

1. Import required libraries and read the dataset
2. Create an application layout
3. Add title to the dashboard using HTML H1 component
4. Add a paragraph about the chart using HTML P component
5. Add the pie chart above using core graph component
6. Run the app

# **TASK 1: Data Preparation**

Letâ€™s start with

* Importing necessary libraries
* Reading and sampling 500 random data points
* Get the chart ready

In [1]:
# Import required libraries

import pandas as pd
import plotly.express as px
import dash
from dash import dcc
from dash import html



In [2]:
# Read the airline data into pandas dataframe
airline_data =  pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/airline_data.csv', 
                            encoding = "ISO-8859-1",
                            dtype={'Div1Airport': str, 'Div1TailNum': str, 
                                   'Div2Airport': str, 'Div2TailNum': str})

# Randomly sample 500 data points. Setting the random state to be 42 so that we get same result.
data = airline_data.sample(n=500, random_state=42)

# Pie Chart Creation
fig = px.pie(data, values='Flights', names='DistanceGroup', title='Distance group proportion by flights')
fig.show()

# **TASK 2: Create dash application and get the layout skeleton**

Next, we create a skeleton for our dash application. Our dashboard application has three components as seen before:

* Title of the application
* Description of the application
* Chart conveying the proportion of distance group by month

Mapping to the respective Dash HTML tags:

* Title added using `html.H1()` tag
* Description added using `html.P()` tag
* Chart added using `dcc.Graph()` tag

Copy the below code to the `dash_basics.py` script and review the structure.

In [5]:
# Create a dash application
app = dash.Dash(__name__)

# Get the layout of the application and adjust it.
# Create an outer division using html.Div and add title to the dashboard using html.H1 component
# Add description about the graph using HTML P (paragraph) component
# Finally, add graph component.
app.layout = html.Div(children=[html.H1('Airline Dashboard',
                                        style={'textAlign': 'center', 
                                               'color': '#503D36', 
                                               'font-size': 40}),
                                html.P('Proportion of distance group (250 mile distance interval group) by flights.', 
                                       style={'textAlign':'center',
                                               'color': '#F57241'}),
                                dcc.Graph(figure=fig),
                                               
                    ])

# Run the application                   
if __name__ == '__main__':
    app.run()


# **TASK 6: Run the application**

* Run the python file using the following command in the terminal

In [None]:
python3.8 dash_basics.py

* Observe the port number shown in the terminal.

<img src=https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%204/images/port.png>

Click on the `Launch Application` option from the side menu bar.Provide the port number and click `OK`


<img src=https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%204/images/launch_application_new.PNG>

The app will open in a new browser tab like below:



<img src=https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/labs/Module%204/images/lab1_output.png>

#### **Congratulations, you have successfully created your first dash application!**

In [94]:
# Data from the table is stored in a dictionary
data = {
    'requester_id': [1, 1, 2, 3],
    'accepter_id': [2, 3, 3, 4],
    'accept_date': ['2016/06/03', '2016/06/08', '2016/06/08', '2016/06/09']
}

# Create the DataFrame from the dictionary
request_accepted_df = pd.DataFrame(data)

# Optional but recommended: convert the date column to a proper datetime format
request_accepted_df['accept_date'] = pd.to_datetime(request_accepted_df['accept_date'])

# change df Name
df_friend = request_accepted_df


In [95]:
df_friend

Unnamed: 0,requester_id,accepter_id,accept_date
0,1,2,2016-06-03
1,1,3,2016-06-08
2,2,3,2016-06-08
3,3,4,2016-06-09


In [97]:
user_id = pd.concat([df_friend['requester_id'], df_friend['accepter_id']])

In [None]:
friend_count = user_id.value_counts()


In [108]:
top_id = friend_count.index[0]
friend_counts = friend_count.iloc[0]

In [111]:
result_df = pd.DataFrame({'id': [top_id],
                          'num': [friend_counts]})

result_df

Unnamed: 0,id,num
0,3,3


In [112]:
sales_person_data = {'sales_id': [1, 2, 3, 4, 5], 
                     'name': ['John', 'Amy', 'Mark', 'Pam', 'Alex'], 
                     'salary': [100000, 12000, 65000, 25000, 5000],
                     'hire_date': ['1/1/2006', '2/1/2010', '12/1/2008', '1/1/2005', '1/1/2007']}
company_data = {'com_id': [1, 2, 3, 4], 
                'name': ['RED', 'ORANGE', 'YELLOW', 'GREEN'], 
                'city': ['Boston', 'New York', 'Boston', 'Austin']}
orders_data = {'order_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
               'order_date': ['1/1/2014', '2/1/2014', '3/1/2014', '4/1/2014', '5/1/2014', '6/1/2014', '7/1/2014', '8/1/2014', '9/1/2014', '10/1/2014'], 
               'com_id': [3, 4, 1, 1, 2, 3, 2, 4, 1, 4], 
               'sales_id': [4, 5, 1, 1, 2, 4, 3, 5, 3, 4], 
               'amount': [10000, 5000, 50000, 12000, 6000, 1000, 7000, 1500, 2000, 3000]}

sales_person_df = pd.DataFrame(sales_person_data)
company_df = pd.DataFrame(company_data)
orders_df = pd.DataFrame(orders_data)

In [125]:
red_company_id = company_df[company_df['name'] == 'RED']['com_id']

In [126]:
red_orders = orders_df[orders_df['com_id'].isin(red_company_id)]['sales_id'].unique()

In [127]:
nored_salesperson = sales_person_df[~sales_person_df['sales_id'].isin(red_orders)]

In [128]:
nored_salesperson[['name']]

Unnamed: 0,name
1,Amy
3,Pam
4,Alex


In [130]:
x = [13, 10]
y = [15, 20]
z = [30, 15]

triangle = pd.DataFrame({'x': x,
                         'y': y,
                         'z': z})

In [131]:
triangle

Unnamed: 0,x,y,z
0,13,15,30
1,10,20,15


In [132]:
triangle['triangle'] = 'No'

1    No
Name: triangle, dtype: object

In [137]:
condition1 = triangle['x'] + triangle['y'] > triangle['z']
condition2 = triangle['x'] + triangle['z'] > triangle['y']
condition3 = triangle['y'] + triangle['z'] > triangle['x']

In [141]:
triangle.loc[condition1 & condition2 & condition3, 'triangle'] = 'Yes'

In [142]:
triangle

Unnamed: 0,x,y,z,triangle
0,13,15,30,No
1,10,20,15,Yes


In [145]:
num = [8, 8, 3, 3, 3 , 1, 4, 5, 6]

my_numbers = pd.DataFrame({'num': num})

my_numbers

Unnamed: 0,num
0,8
1,8
2,3
3,3
4,3
5,1
6,4
7,5
8,6


In [180]:
unique_num = my_numbers.drop_duplicates(subset='num', keep=False)
unique_num

Unnamed: 0,num
5,1
6,4
7,5
8,6


In [193]:
import pandas as pd

# Data from the table is stored in a dictionary
data = {
    'id': [1, 2, 3, 4, 5],
    'movie': ['War', 'Science', 'irish', 'Ice song', 'House card'],
    'description': ['great 3D', 'fiction', 'boring', 'Fantacy', 'Interesting'],
    'rating': [8.9, 8.5, 6.2, 8.6, 9.1]
}

# Create the DataFrame from the dictionary
cinema = pd.DataFrame(data)

# Display the DataFrame
print(cinema)

   id       movie  description  rating
0   1         War     great 3D     8.9
1   2     Science      fiction     8.5
2   3       irish       boring     6.2
3   4    Ice song      Fantacy     8.6
4   5  House card  Interesting     9.1


In [None]:
def not_boring_movies(cinema: pd.DataFrame) -> pd.DataFrame:
    return (
        cinema
        .loc[((cinema['id'] % 2) != 0)]
        .loc[cinema['description'] != 'boring']
        .sort_values(by='rating', ascending=False)
    )

Unnamed: 0,id,movie,description,rating
4,5,House card,Interesting,9.1
0,1,War,great 3D,8.9


In [None]:
import pandas as pd

# Data for the Sales table
sales_data = {
    'seller_id': [1, 1, 2, 3],
    'product_id': [1, 2, 2, 3],
    'buyer_id': [1, 2, 3, 4],
    'sale_date': ['2019-01-21', '2019-02-17', '2019-06-02', '2019-05-13'],
    'quantity': [2, 1, 1, 2],
    'price': [2000, 800, 800, 2800]
}

# Create the Sales DataFrame
sales_df = pd.DataFrame(sales_data)

# It's good practice to convert date columns to a proper datetime format
sales_df['sale_date'] = pd.to_datetime(sales_df['sale_date'])


# Display the DataFrame
print("\nSales DataFrame:")

print(sales_df)



Sales DataFrame:
   seller_id  product_id  buyer_id  sale_date  quantity  price
0          1           1         1 2019-01-21         2   2000
1          1           2         2 2019-02-17         1    800
2          2           2         3 2019-06-02         1    800
3          3           3         4 2019-05-13         2   2800


In [248]:

q1_sale = sales_df[sales_df['sale_date'] <= '2019-03-31']['product_id']
q2_sale = sales_df[sales_df['sale_date'] > '2019-03-31']['product_id']

In [266]:
valid_sale = [product for product in test1 if product not in test2]

In [215]:
import pandas as pd

# Data for the Product table
product_data = {
    'product_id': [1, 2, 3],
    'product_name': ['S8', 'G4', 'iPhone'],
    'unit_price': [1000, 800, 1400]
}

# Create the Product DataFrame
product_df = pd.DataFrame(product_data)

# Display the DataFrame
print("Product DataFrame:")
print(product_df)

Product DataFrame:
   product_id product_name  unit_price
0           1           S8        1000
1           2           G4         800
2           3       iPhone        1400


In [260]:
valid_q1_sale = product_df.loc[product_df['product_id'].isin(valid_sale), ['product_id','product_name']]

In [267]:
import pandas as pd

# Data from the table stored in a Python dictionary
data = {
    'user_id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
    'session_id': [1, 1, 1, 4, 4, 4, 2, 2, 2, 3, 3],
    'activity_date': ['2019-07-20', '2019-07-20', '2019-07-20', '2019-07-20', 
                      '2019-07-21', '2019-07-21', '2019-07-21', '2019-07-21', 
                      '2019-07-21', '2019-06-25', '2019-06-25'],
    'activity_type': ['open_session', 'scroll_down', 'end_session', 
                      'open_session', 'send_message', 'end_session', 
                      'open_session', 'send_message', 'end_session',
                      'open_session', 'end_session']}

activity_df = pd.DataFrame(data)

activity_df

Unnamed: 0,user_id,session_id,activity_date,activity_type
0,1,1,2019-07-20,open_session
1,1,1,2019-07-20,scroll_down
2,1,1,2019-07-20,end_session
3,2,4,2019-07-20,open_session
4,2,4,2019-07-21,send_message
5,2,4,2019-07-21,end_session
6,3,2,2019-07-21,open_session
7,3,2,2019-07-21,send_message
8,3,2,2019-07-21,end_session
9,4,3,2019-06-25,open_session


In [278]:
daily_active = activity_df.groupby('activity_date')['user_id'].value_counts().reset_index()
active_users = daily_active.groupby('activity_date')['user_id'].count().reset_index()

In [282]:
result = active_users[(active_users['activity_date'] >= '2019-06-28') & (active_users['activity_date'] <= '2019-07-27')]
result.columns = ['day', 'active_users']
result

Unnamed: 0,day,active_users
1,2019-07-20,2
2,2019-07-21,2


In [292]:
activity = activity_df[activity_df['activity_date'].between('2019-06-28', '2019-07-27')]
activity = activity.groupby('activity_date').nunique().reset_index()
result = activity[['activity_date', 'user_id']].rename({'activity_date' : 'day',
                                                        'user_id' : 'active_users'}, axis=1)

In [293]:
result

Unnamed: 0,day,active_users
0,2019-07-20,2
1,2019-07-21,2


In [303]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'article_id': [1, 1, 2, 2, 4, 3, 3],
    'author_id': [3, 3, 7, 7, 7, 4, 4],
    'viewer_id': [5, 6, 7, 6, 1, 4, 4],
    'view_date': ['2019-08-01', '2019-08-02', '2019-08-01', '2019-08-02', '2019-07-22', '2019-07-21', '2019-07-21']
}

# Create the DataFrame
views = pd.DataFrame(data)

# It's good practice to convert the date column to a proper datetime format
views['view_date'] = pd.to_datetime(views['view_date'])

views

Unnamed: 0,article_id,author_id,viewer_id,view_date
0,1,3,5,2019-08-01
1,1,3,6,2019-08-02
2,2,7,7,2019-08-01
3,2,7,6,2019-08-02
4,4,7,1,2019-07-22
5,3,4,4,2019-07-21
6,3,4,4,2019-07-21


In [305]:
author_filtered = views[views['viewer_id'] == views['author_id']][['author_id']].drop_duplicates(subset='author_id')
result = author_filtered.sort_values(by='author_id')

In [306]:
result

Unnamed: 0,author_id
5,4
2,7


In [337]:
import pandas as pd

# Data for the Users table
users_data = {
    'user_id': [1, 2, 3, 4],
    'join_date': ['2018-01-01', '2018-02-09', '2018-01-19', '2018-05-21'],
    'favorite_brand': ['Lenovo', 'Samsung', 'LG', 'HP']
}

# Create the Users DataFrame
users = pd.DataFrame(users_data)

# Convert the date column to a proper datetime format
users['join_date'] = pd.to_datetime(users['join_date'])

print("Users DataFrame:")
print(users)

Users DataFrame:
   user_id  join_date favorite_brand
0        1 2018-01-01         Lenovo
1        2 2018-02-09        Samsung
2        3 2018-01-19             LG
3        4 2018-05-21             HP


In [338]:
import pandas as pd

# Data for the Orders table
orders_data = {
    'order_id': [1, 2, 3, 4, 5, 6],
    'order_date': ['2019-08-01', '2018-08-02', '2019-08-03', '2018-08-04', '2018-08-04', '2019-08-05'],
    'item_id': [4, 2, 3, 1, 1, 2],
    'buyer_id': [1, 1, 2, 4, 3, 2],
    'seller_id': [2, 3, 3, 2, 4, 4]
}

# Create the Orders DataFrame
orders= pd.DataFrame(orders_data)

# Convert the date column to a proper datetime format
orders['order_date'] = pd.to_datetime(orders['order_date'])

print("\nOrders DataFrame:")
print(orders)


Orders DataFrame:
   order_id order_date  item_id  buyer_id  seller_id
0         1 2019-08-01        4         1          2
1         2 2018-08-02        2         1          3
2         3 2019-08-03        3         2          3
3         4 2018-08-04        1         4          2
4         5 2018-08-04        1         3          4
5         6 2019-08-05        2         2          4


In [342]:
orders_2019 = orders[orders['order_date'].dt.year == 2019]
joined = pd.merge(users, orders_2019, left_on='user_id', right_on='buyer_id', how='left')

In [345]:
joined = joined.groupby(['user_id', 'join_date'])['order_id'].count().reset_index()

In [350]:
joined = joined.rename({'order_id' : 'orders_in_2019'}, axis=1)


In [351]:
joined

Unnamed: 0,user_id,join_date,orders_in_2019
0,1,2018-01-01,1
1,2,2018-02-09,1
2,3,2018-01-19,1
3,4,2018-05-21,1


In [376]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'delivery_id': [1, 2, 3, 4, 5, 6, 7],
    'customer_id': [1, 2, 1, 3, 3, 2, 4],
    'order_date': ['2019-08-01', '2019-08-02', '2019-08-11', '2019-08-24', '2019-08-21', '2019-08-11', '2019-08-09'],
    'customer_pref_delivery_date': ['2019-08-02', '2019-08-02', '2019-08-12', '2019-08-24', '2019-08-22', '2019-08-13', '2019-08-09']
}

# Create the DataFrame
delivery = pd.DataFrame(data)

# It's good practice to convert date columns to a proper datetime format
delivery['order_date'] = pd.to_datetime(delivery_df['order_date'])
delivery['customer_pref_delivery_date'] = pd.to_datetime(delivery_df['customer_pref_delivery_date'])

# Display the DataFrame
print(delivery)

   delivery_id  customer_id order_date customer_pref_delivery_date
0            1            1 2019-08-01                  2019-08-02
1            2            2 2019-08-02                  2019-08-02
2            3            1 2019-08-11                  2019-08-12
3            4            3 2019-08-24                  2019-08-24
4            5            3 2019-08-21                  2019-08-22
5            6            2 2019-08-11                  2019-08-13
6            7            4 2019-08-09                  2019-08-09


In [377]:
first_order = delivery.sort_values('order_date').drop_duplicates(subset='customer_id', keep='first').copy()

In [378]:
total_customers = first_order['customer_id'].count()
total_customers

np.int64(4)

In [379]:
immediate_orders = 0 

In [380]:
for i, row in first_order.iterrows():
    if row['order_date'] == row['customer_pref_delivery_date'] :
        immediate_orders += 1

In [381]:
immediate_orders

2

In [382]:
total = (first_order['order_date'] == first_order['customer_pref_delivery_date']).sum()
total

np.int64(2)

In [383]:
percentage = round(100 * (immediate_orders/total_customers), 2)

In [384]:
percentage

np.float64(50.0)

In [388]:
data = {'id': [1, 2, 3, 1, 1], 
        'revenue': [8000, 9000, 10000, 7000, 6000], 
        'month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar']}
department_df = pd.DataFrame(data)
department_df

Unnamed: 0,id,revenue,month
0,1,8000,Jan
1,2,9000,Jan
2,3,10000,Feb
3,1,7000,Feb
4,1,6000,Mar


In [389]:
all_months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
department_df['month'] = pd.Categorical(department_df['month'], categories=all_months, ordered=True)
department_df

Unnamed: 0,id,revenue,month
0,1,8000,Jan
1,2,9000,Jan
2,3,10000,Feb
3,1,7000,Feb
4,1,6000,Mar


In [391]:
data = {
    'id': [121, 122, 123, 124, 125, 126, 127, 128, 129, 130],
    'country': ['US', 'US', 'US', 'DE', 'US', 'US', 'US', 'US', 'DE', 'US'],
    'state': ['approved', 'declined', 'approved', 'approved', 'approved', 'declined', 'approved', 'approved', 'approved', 'declined'],
    'amount': [1000, 2000, 2000, 2000, 3000, 4000, 5000, 6000, 7000, 8000],
    'trans_date': pd.to_datetime(['2018-12-18', '2018-12-19', '2019-01-01', '2019-01-02', '2019-01-03', 
                                  '2019-01-03', '2019-01-03', '2019-01-03', '2019-01-07', '2019-02-28'])
}
transactions = pd.DataFrame(data)

In [398]:
transactions

Unnamed: 0,id,country,state,amount,trans_date,month
0,121,US,approved,1000,2018-12-18,2018-12
1,122,US,declined,2000,2018-12-19,2018-12
2,123,US,approved,2000,2019-01-01,2019-01
3,124,DE,approved,2000,2019-01-02,2019-01
4,125,US,approved,3000,2019-01-03,2019-01
5,126,US,declined,4000,2019-01-03,2019-01
6,127,US,approved,5000,2019-01-03,2019-01
7,128,US,approved,6000,2019-01-03,2019-01
8,129,DE,approved,7000,2019-01-07,2019-01
9,130,US,declined,8000,2019-02-28,2019-02


In [399]:
transactions['month'] = transactions['trans_date'].dt.strftime('%Y-%m')
result_df = transactions.groupby(['month', 'country']).agg(
        trans_count=('id', 'count'),
        approved_count=('state', lambda s: (s == 'approved').sum()),
        trans_total_amount=('amount', 'sum'),
        approved_total_amount=('amount', lambda amt: amt[transactions['state'] == 'approved'].sum())
    )


In [400]:
result_df

Unnamed: 0_level_0,Unnamed: 1_level_0,trans_count,approved_count,trans_total_amount,approved_total_amount
month,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-12,US,2,1,3000,1000
2019-01,DE,2,2,9000,9000
2019-01,US,5,4,20000,16000
2019-02,US,1,0,8000,0


In [401]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'person_id': [5, 4, 3, 6, 1, 2],
    'person_name': ['Alice', 'Bob', 'Alex', 'John Cena', 'Winston', 'Marie'],
    'weight': [250, 175, 350, 400, 500, 200],
    'turn': [1, 5, 2, 3, 6, 4]
}

# Create the DataFrame
queue = pd.DataFrame(data)

# Display the DataFrame
print(queue)

   person_id person_name  weight  turn
0          5       Alice     250     1
1          4         Bob     175     5
2          3        Alex     350     2
3          6   John Cena     400     3
4          1     Winston     500     6
5          2       Marie     200     4


In [406]:
queue = queue.sort_values('turn', ascending=True)
queue['cumsum'] = queue['weight'].cumsum()

In [407]:
queue

Unnamed: 0,person_id,person_name,weight,turn,cumsum
0,5,Alice,250,1,250
2,3,Alex,350,2,600
3,6,John Cena,400,3,1000
5,2,Marie,200,4,1200
1,4,Bob,175,5,1375
4,1,Winston,500,6,1875


In [414]:
result = queue[queue['cumsum'] <= 1000].sort_values(by='cumsum', ascending=False)
result[['person_name']].head(1)

Unnamed: 0,person_name
3,John Cena


In [416]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'query_name': ['Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Cat'],
    'result': ['Golden Retriever', 'German Shepherd', 'Mule', 'Shirazi', 'Siamese', 'Sphynx'],
    'position': [1, 2, 200, 5, 3, 7],
    'rating': [5, 5, 1, 2, 3, 4]
}

# Create the DataFrame
queries = pd.DataFrame(data)

# Display the DataFrame
print(queries)

  query_name            result  position  rating
0        Dog  Golden Retriever         1       5
1        Dog   German Shepherd         2       5
2        Dog              Mule       200       1
3        Cat           Shirazi         5       2
4        Cat           Siamese         3       3
5        Cat            Sphynx         7       4


In [432]:
queries['quality'] = (queries['rating'] / queries['position']).round(2)
queries

Unnamed: 0,query_name,result,position,rating,quality
0,Dog,Golden Retriever,1,5,5.0
1,Dog,German Shepherd,2,5,2.5
2,Dog,Mule,200,1,0.0
3,Cat,Shirazi,5,2,0.4
4,Cat,Siamese,3,3,1.0
5,Cat,Sphynx,7,4,0.57


In [None]:
result = queries.groupby('query_name').agg(
    quality = ('quality', 'mean'),
    poor_queries_percentage = ('rating', lambda r: ((r < 3).mean() * 100).round(2)) 
).reset_index()
result

Unnamed: 0,query_name,quality,poor_queries_percentage
0,Cat,0.656667,33.33
1,Dog,2.5,33.33


In [438]:
import pandas as pd

# Data for the Prices table
prices_data = {
    'product_id': [1, 1, 2, 2],
    'start_date': ['2019-02-17', '2019-03-01', '2019-02-01', '2019-02-21'],
    'end_date': ['2019-02-28', '2019-03-22', '2019-02-20', '2019-03-31'],
    'price': [5, 20, 15, 30]
}

# Create the Prices DataFrame
prices_df = pd.DataFrame(prices_data)

# Convert date columns to a proper datetime format
prices_df['start_date'] = pd.to_datetime(prices_df['start_date'])
prices_df['end_date'] = pd.to_datetime(prices_df['end_date'])

print("Prices DataFrame:")
print(prices_df)

Prices DataFrame:
   product_id start_date   end_date  price
0           1 2019-02-17 2019-02-28      5
1           1 2019-03-01 2019-03-22     20
2           2 2019-02-01 2019-02-20     15
3           2 2019-02-21 2019-03-31     30


In [439]:
import pandas as pd
import numpy as np

# Data for the UnitsSold table
units_sold_data = {
    'product_id': [1, 1, 2, 2, np.nan],
    'purchase_date': ['2019-02-25', '2019-03-01', '2019-02-10', '2019-03-22', np.nan],
    'units': [100, 15, 200, 30, np.nan]
}

# Create the UnitsSold DataFrame
units_sold_df = pd.DataFrame(units_sold_data)

# Convert the date column to a proper datetime format
units_sold_df['purchase_date'] = pd.to_datetime(units_sold_df['purchase_date'])

print("\nUnitsSold DataFrame:")
print(units_sold_df)


UnitsSold DataFrame:
   product_id purchase_date  units
0         1.0    2019-02-25  100.0
1         1.0    2019-03-01   15.0
2         2.0    2019-02-10  200.0
3         2.0    2019-03-22   30.0
4         NaN           NaT    NaN


In [442]:
merged = prices_df.merge(units_sold_df, on='product_id', how='left')
date_range_condition = merged['purchase_date'].between(merged['start_date'], merged['end_date'])
merged_df = merged[date_range_condition]
merged_df

Unnamed: 0,product_id,start_date,end_date,price,purchase_date,units
0,1,2019-02-17,2019-02-28,5,2019-02-25,100.0
3,1,2019-03-01,2019-03-22,20,2019-03-01,15.0
4,2,2019-02-01,2019-02-20,15,2019-02-10,200.0
7,2,2019-02-21,2019-03-31,30,2019-03-22,30.0


In [449]:
merged_df['revenue'] = merged_df['price'] * merged_df['units']



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [451]:
summary = merged_df.groupby('product_id').agg(
    total_revenue = ('revenue', 'sum'),
    total_units = ('units', 'sum')
).reset_index()
summary

Unnamed: 0,product_id,total_revenue,total_units
0,1,800.0,115.0
1,2,3900.0,230.0


In [453]:
summary['average_price'] = summary['total_revenue'] / summary['total_units']
summary = summary.fillna(0).round(2)

In [454]:
summary

Unnamed: 0,product_id,total_revenue,total_units,average_price
0,1,800.0,115.0,6.96
1,2,3900.0,230.0,16.96


In [456]:
import pandas as pd

# Data for the Students table
students_data = {
    'student_id': [1, 2, 13, 6],
    'student_name': ['Alice', 'Bob', 'John', 'Alex']
}

# Create the Students DataFrame
students = pd.DataFrame(students_data)

print("Students DataFrame:")
print(students)

Students DataFrame:
   student_id student_name
0           1        Alice
1           2          Bob
2          13         John
3           6         Alex


In [457]:
import pandas as pd

# Data for the Subjects table
subjects_data = {
    'subject_name': ['Math', 'Physics', 'Programming']
}

# Create the Subjects DataFrame
subjects = pd.DataFrame(subjects_data)

print("Subjects DataFrame:")
print(subjects)

Subjects DataFrame:
  subject_name
0         Math
1      Physics
2  Programming


In [458]:
import pandas as pd

# Data for the Examinations table
examinations_data = {
    'student_id': [1, 1, 1, 2, 1, 1, 13, 13, 13, 2, 1],
    'subject_name': ['Math', 'Physics', 'Programming', 'Programming', 'Physics', 'Math', 
                     'Math', 'Programming', 'Physics', 'Math', 'Math']
}

# Create the Examinations DataFrame
examinations = pd.DataFrame(examinations_data)

print("Examinations DataFrame:")
print(examinations)

Examinations DataFrame:
    student_id subject_name
0            1         Math
1            1      Physics
2            1  Programming
3            2  Programming
4            1      Physics
5            1         Math
6           13         Math
7           13  Programming
8           13      Physics
9            2         Math
10           1         Math


In [475]:
students_subjects = students.merge(subjects, how='cross').sort_values('student_id')
students_subjects

Unnamed: 0,student_id,student_name,subject_name
0,1,Alice,Math
1,1,Alice,Physics
2,1,Alice,Programming
3,2,Bob,Math
4,2,Bob,Physics
5,2,Bob,Programming
9,6,Alex,Math
10,6,Alex,Physics
11,6,Alex,Programming
6,13,John,Math


In [478]:
examinations

Unnamed: 0,student_id,subject_name
0,1,Math
1,1,Physics
2,1,Programming
3,2,Programming
4,1,Physics
5,1,Math
6,13,Math
7,13,Programming
8,13,Physics
9,2,Math


In [496]:
attended_exams = examinations.groupby(['student_id', 'subject_name']).size().reset_index(name='attended_exams')

In [None]:
re

Unnamed: 0,student_id,subject_name,attended_exams
0,1,Math,3
1,1,Physics,2
2,1,Programming,1
3,2,Math,1
4,2,Programming,1
5,13,Math,1
6,13,Physics,1
7,13,Programming,1


In [499]:
result = students_subjects.merge(attended_exams, left_on=['student_id', 'subject_name'], right_on=['student_id', 'subject_name'], how='left')
result['attended_exams'] = result['attended_exams'].fillna(0).astype(int)
result

Unnamed: 0,student_id,student_name,subject_name,attended_exams
0,1,Alice,Math,3
1,1,Alice,Physics,2
2,1,Alice,Programming,1
3,2,Bob,Math,1
4,2,Bob,Physics,0
5,2,Bob,Programming,1
6,6,Alex,Math,0
7,6,Alex,Physics,0
8,6,Alex,Programming,0
9,13,John,Math,1


In [525]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 3],
    'name': ['Jhon', 'Daniel', 'Jade', 'Khaled', 'Winston', 'Elvis', 'Anna', 'Maria', 'Jaze', 'Jhon', 'Jade'],
    'visited_on': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', 
                   '2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10', '2019-01-10'],
    'amount': [100, 110, 120, 130, 110, 140, 150, 80, 110, 130, 150]
}

# Create the DataFrame
customer = pd.DataFrame(data)

# Convert the date column to a proper datetime format
customer['visited_on'] = pd.to_datetime(customer['visited_on'])

# Display the DataFrame
print(customer)

    customer_id     name visited_on  amount
0             1     Jhon 2019-01-01     100
1             2   Daniel 2019-01-02     110
2             3     Jade 2019-01-03     120
3             4   Khaled 2019-01-04     130
4             5  Winston 2019-01-05     110
5             6    Elvis 2019-01-06     140
6             7     Anna 2019-01-07     150
7             8    Maria 2019-01-08      80
8             9     Jaze 2019-01-09     110
9             1     Jhon 2019-01-10     130
10            3     Jade 2019-01-10     150


In [526]:
daily_total = customer.groupby('visited_on')['amount'].sum().reset_index()
daily_total

Unnamed: 0,visited_on,amount
0,2019-01-01,100
1,2019-01-02,110
2,2019-01-03,120
3,2019-01-04,130
4,2019-01-05,110
5,2019-01-06,140
6,2019-01-07,150
7,2019-01-08,80
8,2019-01-09,110
9,2019-01-10,280


In [None]:
daily_total['amount'] = daily_total['amount'].rolling(window=7, min_periods=7).sum()


Unnamed: 0,visited_on,amount
0,2019-01-01,
1,2019-01-02,
2,2019-01-03,
3,2019-01-04,
4,2019-01-05,
5,2019-01-06,
6,2019-01-07,860.0
7,2019-01-08,840.0
8,2019-01-09,840.0
9,2019-01-10,1000.0


In [528]:
daily_total['average_amount'] = (daily_total['amount'] / 7).round(2)

In [529]:
result = daily_total.dropna()

In [530]:
result

Unnamed: 0,visited_on,amount,average_amount
6,2019-01-07,860.0,122.86
7,2019-01-08,840.0,120.0
8,2019-01-09,840.0,120.0
9,2019-01-10,1000.0,142.86


In [531]:
import pandas as pd

# Data for the Products table
products_data = {
    'product_id': [1, 2, 3, 4, 5],
    'product_name': ['Leetcode Solutions', 'Jewels of Stringology', 'HP', 'Lenovo', 'Leetcode Kit'],
    'product_category': ['Book', 'Book', 'Laptop', 'Laptop', 'T-shirt']
}

# Create the Products DataFrame
products = pd.DataFrame(products_data)

print("Products DataFrame:")
print(products)

Products DataFrame:
   product_id           product_name product_category
0           1     Leetcode Solutions             Book
1           2  Jewels of Stringology             Book
2           3                     HP           Laptop
3           4                 Lenovo           Laptop
4           5           Leetcode Kit          T-shirt


In [532]:
import pandas as pd

# Data for the Orders table
orders_data = {
    'product_id': [1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5],
    'order_date': ['2020-02-05', '2020-02-10', '2020-01-18', '2020-02-11', 
                   '2020-02-17', '2020-02-24', '2020-03-01', '2020-03-04', 
                   '2020-03-04', '2020-02-25', '2020-02-27', '2020-03-01'],
    'unit': [60, 70, 30, 80, 2, 3, 20, 30, 60, 50, 50, 50]
}

# Create the Orders DataFrame
orders = pd.DataFrame(orders_data)

# Convert the date column to a proper datetime format
orders['order_date'] = pd.to_datetime(orders['order_date'])

print("\nOrders DataFrame:")
print(orders)


Orders DataFrame:
    product_id order_date  unit
0            1 2020-02-05    60
1            1 2020-02-10    70
2            2 2020-01-18    30
3            2 2020-02-11    80
4            3 2020-02-17     2
5            3 2020-02-24     3
6            4 2020-03-01    20
7            4 2020-03-04    30
8            4 2020-03-04    60
9            5 2020-02-25    50
10           5 2020-02-27    50
11           5 2020-03-01    50


In [533]:
orders_feb2020 = orders[orders['order_date'].between('2020-02-01', '2020-02-29')]
orders_feb2020

Unnamed: 0,product_id,order_date,unit
0,1,2020-02-05,60
1,1,2020-02-10,70
3,2,2020-02-11,80
4,3,2020-02-17,2
5,3,2020-02-24,3
9,5,2020-02-25,50
10,5,2020-02-27,50


In [535]:
merged = products.merge(orders_feb2020, on='product_id', how='left')
merged

Unnamed: 0,product_id,product_name,product_category,order_date,unit
0,1,Leetcode Solutions,Book,2020-02-05,60.0
1,1,Leetcode Solutions,Book,2020-02-10,70.0
2,2,Jewels of Stringology,Book,2020-02-11,80.0
3,3,HP,Laptop,2020-02-17,2.0
4,3,HP,Laptop,2020-02-24,3.0
5,4,Lenovo,Laptop,NaT,
6,5,Leetcode Kit,T-shirt,2020-02-25,50.0
7,5,Leetcode Kit,T-shirt,2020-02-27,50.0


In [537]:
result = merged.groupby('product_name')['unit'].sum().reset_index()
result = result[result['unit'] >= 100] 
result

Unnamed: 0,product_name,unit
2,Leetcode Kit,100.0
3,Leetcode Solutions,130.0


In [538]:
# --- Create the DataFrames based on the problem description ---
movies_data = {'movie_id': [1, 2, 3], 'title': ['Avengers', 'Frozen 2', 'Joker']}
movies = pd.DataFrame(movies_data)

users_data = {'user_id': [1, 2, 3, 4], 'name': ['Daniel', 'Monica', 'Maria', 'James']}
users = pd.DataFrame(users_data)

ratings_data = {
    'movie_id': [1, 1, 1, 1, 2, 2, 2, 3, 3],
    'user_id': [1, 2, 3, 4, 1, 2, 3, 1, 2],
    'rating': [3, 4, 2, 1, 5, 2, 2, 3, 4],
    'created_at': pd.to_datetime(['2020-01-12', '2020-02-11', '2020-02-12', '2020-01-01', 
                                  '2020-02-17', '2020-02-01', '2020-03-01', '2020-02-22', '2020-02-25'])
}
movie_ratings = pd.DataFrame(ratings_data)


In [541]:
user_rating = movie_ratings['user_id'].value_counts().reset_index(name='rating_count')
user_rating

Unnamed: 0,user_id,rating_count
0,1,3
1,2,3
2,3,2
3,4,1


In [542]:
user_rating = user_rating.merge(users, on='user_id')
user_rating

Unnamed: 0,user_id,rating_count,name
0,1,3,Daniel
1,2,3,Monica
2,3,2,Maria
3,4,1,James


In [544]:
top_user = user_rating.sort_values(by=['rating_count', 'name'], ascending=[False, True]).iloc[0]
top_user.to_frame()

Unnamed: 0,0
user_id,1
rating_count,3
name,Daniel


In [545]:
feb_rating = movie_ratings[movie_ratings['created_at'].dt.to_period('M') == '2020-02']
feb_rating

Unnamed: 0,movie_id,user_id,rating,created_at
1,1,2,4,2020-02-11
2,1,3,2,2020-02-12
4,2,1,5,2020-02-17
5,2,2,2,2020-02-01
7,3,1,3,2020-02-22
8,3,2,4,2020-02-25


In [547]:
movie_avg_rating = feb_rating.groupby('movie_id')['rating'].mean().reset_index(name='avg_rating')
movie_avg_rating = movie_avg_rating.merge(movies, on='movie_id')
movie_avg_rating

Unnamed: 0,movie_id,avg_rating,title
0,1,3.0,Avengers
1,2,3.5,Frozen 2
2,3,3.5,Joker


In [548]:
top_movie = movie_avg_rating.sort_values(by=['avg_rating', 'title'], ascending=[False, True]).iloc[0]
top_movie

movie_id             2
avg_rating         3.5
title         Frozen 2
Name: 1, dtype: object

In [550]:
result = pd.DataFrame({'result' : [top_user['name'], top_movie['title']]})
result

Unnamed: 0,result
0,Daniel
1,Frozen 2


In [551]:
import pandas as pd

# Data for the Employees table
employees_data = {
    'id': [1, 7, 11, 90, 3],
    'name': ['Alice', 'Bob', 'Meir', 'Winston', 'Jonathan']
}

# Create the Employees DataFrame
employees = pd.DataFrame(employees_data)

print("Employees DataFrame:")
print(employees)

Employees DataFrame:
   id      name
0   1     Alice
1   7       Bob
2  11      Meir
3  90   Winston
4   3  Jonathan


In [556]:
import pandas as pd

# Data for the EmployeeUNI table
employee_uni_data = {
    'id': [3, 11, 90],
    'unique_id': [1, 2, 3]
}

# Create the EmployeeUNI DataFrame
employee_uni = pd.DataFrame(employee_uni_data)
employee_uni['unique_id'] = employee_uni['unique_id'].astype(int)

print("\nEmployeeUNI DataFrame:")
print(employee_uni)


EmployeeUNI DataFrame:
   id  unique_id
0   3          1
1  11          2
2  90          3


In [565]:
merged = employees.merge(employee_uni, on='id', how='left')
merged['unique_id'] = merged['unique_id'].astype('Int64')
merged

Unnamed: 0,id,name,unique_id
0,1,Alice,
1,7,Bob,
2,11,Meir,2.0
3,90,Winston,3.0
4,3,Jonathan,1.0


In [566]:
merged[['unique_id', 'name']]

Unnamed: 0,unique_id,name
0,,Alice
1,,Bob
2,2.0,Meir
3,3.0,Winston
4,1.0,Jonathan


In [575]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'stock_name': ['Leetcode', 'Corona Masks', 'Leetcode', 'Handbags', 'Corona Masks',
                   'Corona Masks', 'Corona Masks', 'Corona Masks', 'Handbags', 'Corona Masks'],
    'operation': ['Buy', 'Buy', 'Sell', 'Buy', 'Sell', 'Buy', 'Sell', 'Buy', 'Sell', 'Sell'],
    'operation_day': [1, 2, 5, 17, 3, 4, 5, 6, 29, 10],
    'price': [1000, 10, 9000, 30000, 1010, 1000, 500, 1000, 7000, 10000]
}

# Create the DataFrame
stocks = pd.DataFrame(data)

# Display the DataFrame
print(stocks)

     stock_name operation  operation_day  price
0      Leetcode       Buy              1   1000
1  Corona Masks       Buy              2     10
2      Leetcode      Sell              5   9000
3      Handbags       Buy             17  30000
4  Corona Masks      Sell              3   1010
5  Corona Masks       Buy              4   1000
6  Corona Masks      Sell              5    500
7  Corona Masks       Buy              6   1000
8      Handbags      Sell             29   7000
9  Corona Masks      Sell             10  10000


In [577]:
stocks['floating'] = np.where(stocks['operation'] == 'Buy', -stocks['price'], stocks['price'])
stocks

Unnamed: 0,stock_name,operation,operation_day,price,floating
0,Leetcode,Buy,1,1000,-1000
1,Corona Masks,Buy,2,10,-10
2,Leetcode,Sell,5,9000,9000
3,Handbags,Buy,17,30000,-30000
4,Corona Masks,Sell,3,1010,1010
5,Corona Masks,Buy,4,1000,-1000
6,Corona Masks,Sell,5,500,500
7,Corona Masks,Buy,6,1000,-1000
8,Handbags,Sell,29,7000,7000
9,Corona Masks,Sell,10,10000,10000


In [580]:
result = stocks.groupby('stock_name')['floating'].sum().reset_index()
result = result.rename(columns={'floating': 'capital_gain_loss'})
result

Unnamed: 0,stock_name,capital_gain_loss
0,Corona Masks,9500
1,Handbags,-23000
2,Leetcode,8000


In [42]:
import pandas as pd

# Data for the Users table
users_data = {
    'id': [1, 2, 3, 4, 7, 13, 19],
    'name': ['Alice', 'Bob', 'Alex', 'Donald', 'Lee', 'Jonathan', 'Elvis']
}

# Create the Users DataFrame
users = pd.DataFrame(users_data)

print("Users DataFrame:")
print(users)

Users DataFrame:
   id      name
0   1     Alice
1   2       Bob
2   3      Alex
3   4    Donald
4   7       Lee
5  13  Jonathan
6  19     Elvis


In [43]:
import pandas as pd

# Data for the Rides table
rides_data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9],
    'user_id': [1, 2, 3, 7, 13, 19, 7, 19, 7],
    'distance': [120, 317, 222, 100, 312, 50, 120, 400, 230]
}

# Create the Rides DataFrame
rides = pd.DataFrame(rides_data)

print("\nRides DataFrame:")
print(rides)


Rides DataFrame:
   id  user_id  distance
0   1        1       120
1   2        2       317
2   3        3       222
3   4        7       100
4   5       13       312
5   6       19        50
6   7        7       120
7   8       19       400
8   9        7       230


In [44]:
merged = users.merge(rides, left_on='id', right_on='user_id', how='left')
merged

Unnamed: 0,id_x,name,id_y,user_id,distance
0,1,Alice,1.0,1.0,120.0
1,2,Bob,2.0,2.0,317.0
2,3,Alex,3.0,3.0,222.0
3,4,Donald,,,
4,7,Lee,4.0,7.0,100.0
5,7,Lee,7.0,7.0,120.0
6,7,Lee,9.0,7.0,230.0
7,13,Jonathan,5.0,13.0,312.0
8,19,Elvis,6.0,19.0,50.0
9,19,Elvis,8.0,19.0,400.0


In [45]:
result = merged.groupby(['id_x', 'name'])['distance'].sum().reset_index()
result

Unnamed: 0,id_x,name,distance
0,1,Alice,120.0
1,2,Bob,317.0
2,3,Alex,222.0
3,4,Donald,0.0
4,7,Lee,450.0
5,13,Jonathan,312.0
6,19,Elvis,450.0


In [1]:
import pandas as pd

# Data for the Visits table
visits_data = {
    'visit_id': [1, 2, 4, 5, 6, 7, 8],
    'customer_id': [23, 9, 30, 54, 96, 54, 54]
}

# Create the Visits DataFrame
visits = pd.DataFrame(visits_data)

print("Visits DataFrame:")
print(visits)

Visits DataFrame:
   visit_id  customer_id
0         1           23
1         2            9
2         4           30
3         5           54
4         6           96
5         7           54
6         8           54


In [2]:
import pandas as pd

# Data for the Transactions table
transactions_data = {
    'transaction_id': [2, 3, 9, 12, 13],
    'visit_id': [5, 5, 5, 1, 2],
    'amount': [310, 300, 200, 910, 970]
}

# Create the Transactions DataFrame
transactions = pd.DataFrame(transactions_data)

print("\nTransactions DataFrame:")
print(transactions)


Transactions DataFrame:
   transaction_id  visit_id  amount
0               2         5     310
1               3         5     300
2               9         5     200
3              12         1     910
4              13         2     970


In [3]:
merged = visits.merge(transactions, on='visit_id', how='left')
merged

Unnamed: 0,visit_id,customer_id,transaction_id,amount
0,1,23,12.0,910.0
1,2,9,13.0,970.0
2,4,30,,
3,5,54,2.0,310.0
4,5,54,3.0,300.0
5,5,54,9.0,200.0
6,6,96,,
7,7,54,,
8,8,54,,


In [8]:
result = merged[merged['transaction_id'].isnull()]
result = result['customer_id'].value_counts().reset_index(name='count_no_trans')
result

Unnamed: 0,customer_id,count_no_trans
0,54,2
1,30,1
2,96,1


In [9]:
import pandas as pd

# Data for the Users table
users_data = {
    'user_id': [6, 2, 7],
    'user_name': ['Alice', 'Bob', 'Alex']
}

# Create the Users DataFrame
users = pd.DataFrame(users_data)

print("Users DataFrame:")
print(users)

Users DataFrame:
   user_id user_name
0        6     Alice
1        2       Bob
2        7      Alex


In [10]:
import pandas as pd

# Data for the Register table
register_data = {
    'contest_id': [215, 209, 208, 210, 208, 209, 209, 215, 208, 210, 207, 210],
    'user_id': [6, 2, 2, 6, 6, 7, 6, 7, 7, 2, 2, 7]
}

# Create the Register DataFrame
register = pd.DataFrame(register_data)

print("\nRegister DataFrame:")
print(register)


Register DataFrame:
    contest_id  user_id
0          215        6
1          209        2
2          208        2
3          210        6
4          208        6
5          209        7
6          209        6
7          215        7
8          208        7
9          210        2
10         207        2
11         210        7


In [25]:
registered = register.groupby('contest_id')['user_id'].count().reset_index(name='register_count')
registered

Unnamed: 0,contest_id,register_count
0,207,1
1,208,3
2,209,3
3,210,3
4,215,2


In [15]:
total_users = users['user_id'].count()
total_users

np.int64(3)

In [22]:
registered['percentage'] = (registered['user_id'] * 100 / total_users).round(2)
registered

Unnamed: 0,contest_id,user_id,percentage
0,207,1,33.33
1,208,3,100.0
2,209,3,100.0
3,210,3,100.0
4,215,2,66.67


In [24]:
result = registered[['contest_id', 'percentage']].sort_values(by=['percentage', 'contest_id'], ascending=[False, True])
result

Unnamed: 0,contest_id,percentage
1,208,100.0
2,209,100.0
3,210,100.0
4,215,66.67
0,207,33.33


In [26]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'machine_id': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
    'process_id': [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
    'activity_type': ['start', 'end', 'start', 'end', 'start', 'end', 'start', 'end', 'start', 'end', 'start', 'end'],
    'timestamp': [0.712, 1.520, 3.140, 4.120, 0.550, 1.550, 0.430, 1.420, 4.100, 4.512, 2.500, 5.000]
}

# Create the DataFrame
activity = pd.DataFrame(data)

# Display the DataFrame
print(activity)

    machine_id  process_id activity_type  timestamp
0            0           0         start      0.712
1            0           0           end      1.520
2            0           1         start      3.140
3            0           1           end      4.120
4            1           0         start      0.550
5            1           0           end      1.550
6            1           1         start      0.430
7            1           1           end      1.420
8            2           0         start      4.100
9            2           0           end      4.512
10           2           1         start      2.500
11           2           1           end      5.000


In [28]:
start_type = activity[activity['activity_type'] == 'start']
start_type

Unnamed: 0,machine_id,process_id,activity_type,timestamp
0,0,0,start,0.712
2,0,1,start,3.14
4,1,0,start,0.55
6,1,1,start,0.43
8,2,0,start,4.1
10,2,1,start,2.5


In [29]:
end_type = activity[activity['activity_type'] == 'end']
end_type

Unnamed: 0,machine_id,process_id,activity_type,timestamp
1,0,0,end,1.52
3,0,1,end,4.12
5,1,0,end,1.55
7,1,1,end,1.42
9,2,0,end,4.512
11,2,1,end,5.0


In [36]:
summary = start_type.merge(end_type, on=['machine_id', 'process_id'])
summary['difference'] = summary['timestamp_y'] - summary['timestamp_x']
summary

Unnamed: 0,machine_id,process_id,activity_type_x,timestamp_x,activity_type_y,timestamp_y,difference
0,0,0,start,0.712,end,1.52,0.808
1,0,1,start,3.14,end,4.12,0.98
2,1,0,start,0.55,end,1.55,1.0
3,1,1,start,0.43,end,1.42,0.99
4,2,0,start,4.1,end,4.512,0.412
5,2,1,start,2.5,end,5.0,2.5


In [37]:
total_process = summary.groupby('machine_id')['process_id'].count().reset_index(name='process_count')
total_process

Unnamed: 0,machine_id,process_count
0,0,2
1,1,2
2,2,2


In [40]:
result = summary.merge(total_process, on='machine_id')
result = result.groupby(['machine_id','process_count'])['difference'].sum().reset_index()
result['processing_time'] = (result['difference'] / result['process_count']).round(3)
result

Unnamed: 0,machine_id,process_count,difference,processing_time
0,0,2,1.788,0.894
1,1,2,1.99,0.995
2,2,2,2.912,1.456


In [48]:
import pandas as pd
import numpy as np

# Data from the table stored in a dictionary
# We use np.nan to represent the null values
data = {
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Michael', 'Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'reports_to': [np.nan, 1, 1, 2, 2, 3, np.nan, np.nan],
    'age': [45, 38, 42, 34, 40, 37, 50, 48]
}

# Create the DataFrame
employees = pd.DataFrame(data)

# Display the DataFrame
print(employees)

   employee_id     name  reports_to  age
0            1  Michael         NaN   45
1            2    Alice         1.0   38
2            3      Bob         1.0   42
3            4  Charlie         2.0   34
4            5    David         2.0   40
5            6      Eve         3.0   37
6            7    Frank         NaN   50
7            8    Grace         NaN   48


In [52]:
merged = employees.merge(employees, left_on='reports_to', right_on='employee_id', how='left', suffixes=('_employee', '_manager'))
merged

Unnamed: 0,employee_id_employee,name_employee,reports_to_employee,age_employee,employee_id_manager,name_manager,reports_to_manager,age_manager
0,1,Michael,,45,,,,
1,2,Alice,1.0,38,1.0,Michael,,45.0
2,3,Bob,1.0,42,1.0,Michael,,45.0
3,4,Charlie,2.0,34,2.0,Alice,1.0,38.0
4,5,David,2.0,40,2.0,Alice,1.0,38.0
5,6,Eve,3.0,37,3.0,Bob,1.0,42.0
6,7,Frank,,50,,,,
7,8,Grace,,48,,,,


In [53]:
result = merged[['employee_id_manager', 'name_manager', 'reports_to_employee', 'age_employee']]
result

Unnamed: 0,employee_id_manager,name_manager,reports_to_employee,age_employee
0,,,,45
1,1.0,Michael,1.0,38
2,1.0,Michael,1.0,42
3,2.0,Alice,2.0,34
4,2.0,Alice,2.0,40
5,3.0,Bob,3.0,37
6,,,,50
7,,,,48


In [54]:
result = result.groupby(['employee_id_manager', 'name_manager']).agg(
    reports_count = ('reports_to_employee', 'count'),
    average_age = ('age_employee', 'mean')
).reset_index()
result

Unnamed: 0,employee_id_manager,name_manager,reports_count,average_age
0,1.0,Michael,2,40.0
1,2.0,Alice,2,37.0
2,3.0,Bob,1,37.0


In [55]:
result = result.rename(columns={'employee_id_manager' : 'employee_id',
                                'name_manager' : 'name'})
result

Unnamed: 0,employee_id,name,reports_count,average_age
0,1.0,Michael,2,40.0
1,2.0,Alice,2,37.0
2,3.0,Bob,1,37.0


In [56]:
result['average_age'] = result['average_age'].round()

In [57]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'product_id': [0, 1, 2, 3, 4],
    'low_fats': ['Y', 'Y', 'N', 'Y', 'N'],
    'recyclable': ['N', 'Y', 'Y', 'Y', 'N']
}

# Create the DataFrame
products = pd.DataFrame(data)

# Display the DataFrame
print(products)

   product_id low_fats recyclable
0           0        Y          N
1           1        Y          Y
2           2        N          Y
3           3        Y          Y
4           4        N          N


In [61]:
conditions = (products['low_fats'] == 'Y') & (products['recyclable'] == 'Y')
result = products.loc[conditions, ['product_id']]
result

Unnamed: 0,product_id
1,1
3,3


In [62]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'employee_id': [1, 2, 2, 3, 4, 4, 4],
    'department_id': [1, 1, 2, 3, 2, 3, 4],
    'primary_flag': ['N', 'Y', 'N', 'N', 'N', 'Y', 'N']
}

# Create the DataFrame
employee = pd.DataFrame(data)

# Display the DataFrame
print(employee)

   employee_id  department_id primary_flag
0            1              1            N
1            2              1            Y
2            2              2            N
3            3              3            N
4            4              2            N
5            4              3            Y
6            4              4            N


In [66]:
one_department = employee.groupby('employee_id')['department_id'].nunique().reset_index(name='dept_count')
one_department = one_department[one_department['dept_count'] == 1]['employee_id']
result = employee[(employee['primary_flag'] == 'Y') | (employee['employee_id'].isin(one_department))][['employee_id', 'department_id']]
result

Unnamed: 0,employee_id,department_id
0,1,1
1,2,1
3,3,3
5,4,3


In [64]:
one_department

Unnamed: 0,employee_id,dept_count
0,1,1
2,3,1


In [1]:
import pandas as pd

# Data for the Signups table
signups_data = {
    'user_id': [3, 7, 2, 6],
    'time_stamp': ['2020-03-21 10:16:13', '2020-01-04 13:57:59', 
                   '2020-07-29 23:09:44', '2020-12-09 10:39:37']
}

# Create the Signups DataFrame
signups = pd.DataFrame(signups_data)

# Convert the time_stamp column to a proper datetime format
signups['time_stamp'] = pd.to_datetime(signups['time_stamp'])

print("Signups DataFrame:")
print(signups)

Signups DataFrame:
   user_id          time_stamp
0        3 2020-03-21 10:16:13
1        7 2020-01-04 13:57:59
2        2 2020-07-29 23:09:44
3        6 2020-12-09 10:39:37


In [15]:
import pandas as pd

# Data for the Confirmations table
confirmations_data = {
    'user_id': [3, 3, 7, 7, 7, 2, 2],
    'time_stamp': ['2021-01-06 03:30:46', '2021-07-14 14:00:00', 
                   '2021-06-12 11:57:29', '2021-06-13 12:58:28', 
                   '2021-06-14 13:59:27', '2021-01-22 00:00:00', 
                   '2021-02-28 23:59:59'],
    'action': ['timeout', 'timeout', 'confirmed', 'confirmed', 'confirmed', 'confirmed', 'timeout']
}

# Create the Confirmations DataFrame
confirmations = pd.DataFrame(confirmations_data)

# Convert the time_stamp column to a proper datetime format
confirmations['time_stamp'] = pd.to_datetime(confirmations['time_stamp'])

print("\nConfirmations DataFrame:")
print(confirmations)


Confirmations DataFrame:
   user_id          time_stamp     action
0        3 2021-01-06 03:30:46    timeout
1        3 2021-07-14 14:00:00    timeout
2        7 2021-06-12 11:57:29  confirmed
3        7 2021-06-13 12:58:28  confirmed
4        7 2021-06-14 13:59:27  confirmed
5        2 2021-01-22 00:00:00  confirmed
6        2 2021-02-28 23:59:59    timeout


In [16]:
import numpy as np

In [17]:
merged = signups.merge(confirmations, on='user_id', how='left', suffixes=('_signup', '_confirmation'))
merged

Unnamed: 0,user_id,time_stamp_signup,time_stamp_confirmation,action
0,3,2020-03-21 10:16:13,2021-01-06 03:30:46,timeout
1,3,2020-03-21 10:16:13,2021-07-14 14:00:00,timeout
2,7,2020-01-04 13:57:59,2021-06-12 11:57:29,confirmed
3,7,2020-01-04 13:57:59,2021-06-13 12:58:28,confirmed
4,7,2020-01-04 13:57:59,2021-06-14 13:59:27,confirmed
5,2,2020-07-29 23:09:44,2021-01-22 00:00:00,confirmed
6,2,2020-07-29 23:09:44,2021-02-28 23:59:59,timeout
7,6,2020-12-09 10:39:37,NaT,


In [18]:
merged['confirmed'] = np.where(merged['action'] == 'confirmed', 1, 0)
merged['actions'] = np.where(merged['action'].isnull(), 0, 1)
merged = merged.groupby('user_id').agg(
    total_actions = ('actions', 'sum'),
    total_confirmed = ('confirmed', 'sum')
).reset_index()
merged

Unnamed: 0,user_id,total_actions,total_confirmed
0,2,2,1
1,3,2,0
2,6,0,0
3,7,3,3


In [20]:
merged['confirmation_rate'] = (merged['total_confirmed']/merged['total_actions']).round(2).fillna(0)
merged

Unnamed: 0,user_id,total_actions,total_confirmed,confirmation_rate
0,2,2,1,0.5
1,3,2,0,0.0
2,6,0,0,0.0
3,7,3,3,1.0


In [21]:
import pandas as pd

# Data for the Employees table
employees_data = {
    'employee_id': [2, 4, 5],
    'name': ['Crew', 'Haven', 'Kristian']
}

# Create the Employees DataFrame
employees = pd.DataFrame(employees_data)

print("Employees DataFrame:")
print(employees)

Employees DataFrame:
   employee_id      name
0            2      Crew
1            4     Haven
2            5  Kristian


In [22]:
import pandas as pd

# Data for the Salaries table
salaries_data = {
    'employee_id': [5, 1, 4],
    'salary': [76071, 22517, 63539]
}

# Create the Salaries DataFrame
salaries = pd.DataFrame(salaries_data)

print("\nSalaries DataFrame:")
print(salaries)


Salaries DataFrame:
   employee_id  salary
0            5   76071
1            1   22517
2            4   63539


In [27]:
missing_data1 = employees[~employees['employee_id'].isin(salaries['employee_id'])]
missing_data1

Unnamed: 0,employee_id,name
0,2,Crew


In [24]:
missing_data2 = salaries[~salaries['employee_id'].isin(employees['employee_id'])]
missing_data2

Unnamed: 0,employee_id,salary
1,1,22517


In [None]:
result = pd.DataFrame({'employee_id' : (missing_data1['employee_id'].tolist() + missing_data2['employee_id'].tolist())}).sort_values(by='employee_id').reset_index(drop=True)
result


Unnamed: 0,employee_id
0,1
1,2


In [32]:
data = {
    'student_id': [1, 1, 1, 2, 2, 1, 1],
    'subject': ['Math', 'Math', 'Math', 'Physics', 'Physics', 'Physics', 'Physics'],
    'exam_date': pd.to_datetime(['2025-01-01', '2025-02-01', '2025-03-01', 
                                '2025-01-15', '2025-02-15', '2025-01-10', '2025-02-10']),
    'score': [80, 70, 90, 60, 75, 90, 80]
}
scores = pd.DataFrame(data)

In [33]:
sorted_scores = scores.sort_values(by='exam_date')

grouped = sorted_scores.groupby(['student_id', 'subject'])

first_score = grouped['score'].first()
latest_score = grouped['score'].last()

summary = pd.DataFrame({
    'first_score': first_score,
    'latest_score': latest_score
}).reset_index()

In [34]:
summary

Unnamed: 0,student_id,subject,first_score,latest_score
0,1,Math,80,90
1,1,Physics,90,80
2,2,Physics,60,75


In [36]:
result = summary[summary['latest_score'] > summary['first_score']].sort_values(by=['student_id', 'subject']).reset_index(drop=True)

result

Unnamed: 0,student_id,subject,first_score,latest_score
0,1,Math,80,90
1,2,Physics,60,75


In [13]:
import pandas as pd

# Data for the Employee table
employee_data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Joe', 'Jim', 'Henry', 'Sam', 'Max'],
    'salary': [70000, 90000, 80000, 60000, 90000],
    'departmentId': [1, 1, 2, 2, 1]
}

# Create the Employee DataFrame
employee = pd.DataFrame(employee_data)

print("Employee DataFrame:")
print(employee)

Employee DataFrame:
   id   name  salary  departmentId
0   1    Joe   70000             1
1   2    Jim   90000             1
2   3  Henry   80000             2
3   4    Sam   60000             2
4   5    Max   90000             1


In [14]:
import pandas as pd

# Data for the Department table
department_data = {
    'id': [1, 2],
    'name': ['IT', 'Sales']
}

# Create the Department DataFrame
department = pd.DataFrame(department_data)

print("\nDepartment DataFrame:")
print(department)


Department DataFrame:
   id   name
0   1     IT
1   2  Sales


In [16]:
merged = employee.merge(department, left_on='departmentId', right_on='id', how='left')
merged

Unnamed: 0,id_x,name_x,salary,departmentId,id_y,name_y
0,1,Joe,70000,1,1,IT
1,2,Jim,90000,1,1,IT
2,3,Henry,80000,2,2,Sales
3,4,Sam,60000,2,2,Sales
4,5,Max,90000,1,1,IT


In [18]:
highes_salary = merged.loc[merged.groupby('departmentId')['salary'].transform('max') == merged['salary']]
highes_salary

Unnamed: 0,id_x,name_x,salary,departmentId,id_y,name_y
1,2,Jim,90000,1,1,IT
2,3,Henry,80000,2,2,Sales
4,5,Max,90000,1,1,IT


In [8]:
highest_salary = merged[merged['salary'] == merged['highest_salary']]
highest_salary

Unnamed: 0,id_x,name_x,salary,departmentId,id_y,name_y,highest_salary
1,2,Jim,90000,1,1,IT,90000
2,3,Henry,80000,2,2,Sales,80000
4,5,Max,90000,1,1,IT,90000


In [12]:
result = highest_salary[['name_y', 'name_x', 'salary']].rename(columns={
    'name_y' : 'Department',
    'name_x' : 'Employee',
    'salary' : 'Salary'
}).sort_values(by='Salary', ascending=False).reset_index(drop=True)

result

Unnamed: 0,Department,Employee,Salary
0,IT,Jim,90000
1,IT,Max,90000
2,Sales,Henry,80000


In [19]:
import pandas as pd

# Data for the Employee table
employee_data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Joe', 'Jim', 'Henry', 'Sam', 'Max'],
    'salary': [70000, 90000, 80000, 60000, 90000],
    'departmentId': [1, 1, 2, 2, 1]
}

# Create the Employee DataFrame
employee = pd.DataFrame(employee_data)

print("Employee DataFrame:")
print(employee)

Employee DataFrame:
   id   name  salary  departmentId
0   1    Joe   70000             1
1   2    Jim   90000             1
2   3  Henry   80000             2
3   4    Sam   60000             2
4   5    Max   90000             1


In [20]:
import pandas as pd

# Data for the Department table
department_data = {
    'id': [1, 2],
    'name': ['IT', 'Sales']
}

# Create the Department DataFrame
department = pd.DataFrame(department_data)

print("\nDepartment DataFrame:")
print(department)


Department DataFrame:
   id   name
0   1     IT
1   2  Sales


In [21]:
merged = employee.merge(department, left_on='departmentId', right_on='id')
merged

Unnamed: 0,id_x,name_x,salary,departmentId,id_y,name_y
0,1,Joe,70000,1,1,IT
1,2,Jim,90000,1,1,IT
2,3,Henry,80000,2,2,Sales
3,4,Sam,60000,2,2,Sales
4,5,Max,90000,1,1,IT


In [23]:
merged['rank'] = merged.groupby('departmentId')['salary'].rank(method='dense', ascending=False)
high_earners = merged[merged['rank'] <= 3]
high_earners

Unnamed: 0,id_x,name_x,salary,departmentId,id_y,name_y,rank
0,1,Joe,70000,1,1,IT,2.0
1,2,Jim,90000,1,1,IT,1.0
2,3,Henry,80000,2,2,Sales,1.0
3,4,Sam,60000,2,2,Sales,2.0
4,5,Max,90000,1,1,IT,1.0


In [30]:
import pandas as pd

# Data for the Person table
data = {
    'id': [1, 2, 3],
    'email': ['john@example.com', 'bob@example.com', 'john@example.com']
}

# Create the Person DataFrame
person = pd.DataFrame(data)

print("Person DataFrame:")
print(person)

Person DataFrame:
   id             email
0   1  john@example.com
1   2   bob@example.com
2   3  john@example.com


In [31]:
person = person.drop_duplicates(subset='email')
person

Unnamed: 0,id,email
0,1,john@example.com
1,2,bob@example.com


In [42]:
import pandas as pd

# Data for the Trips table
trips_data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'client_id': [1, 2, 3, 4, 1, 2, 3, 2, 3, 4],
    'driver_id': [10, 11, 12, 13, 10, 11, 12, 12, 10, 13],
    'city_id': [1, 1, 6, 6, 1, 6, 6, 12, 12, 12],
    'status': ['completed', 'cancelled_by_driver', 'completed', 'cancelled_by_client',
               'completed', 'completed', 'completed', 'completed', 'completed', 'cancelled_by_driver'],
    'request_at': ['2013-10-01', '2013-10-01', '2013-10-01', '2013-10-01', '2013-10-02',
                   '2013-10-02', '2013-10-02', '2013-10-03', '2013-10-03', '2013-10-03']
}

# Create the Trips DataFrame
trips = pd.DataFrame(trips_data)

# Convert date column to proper datetime format
trips['request_at'] = pd.to_datetime(trips['request_at'])

print("Trips DataFrame:")
print(trips)

Trips DataFrame:
   id  client_id  driver_id  city_id               status request_at
0   1          1         10        1            completed 2013-10-01
1   2          2         11        1  cancelled_by_driver 2013-10-01
2   3          3         12        6            completed 2013-10-01
3   4          4         13        6  cancelled_by_client 2013-10-01
4   5          1         10        1            completed 2013-10-02
5   6          2         11        6            completed 2013-10-02
6   7          3         12        6            completed 2013-10-02
7   8          2         12       12            completed 2013-10-03
8   9          3         10       12            completed 2013-10-03
9  10          4         13       12  cancelled_by_driver 2013-10-03


In [43]:
import pandas as pd

# Data for the Users table
users_data = {
    'users_id': [1, 2, 3, 4, 10, 11, 12, 13],
    'banned': ['No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No'],
    'role': ['client', 'client', 'client', 'client', 'driver', 'driver', 'driver', 'driver']
}

# Create the Users DataFrame
users = pd.DataFrame(users_data)

print("\nUsers DataFrame:")
print(users)


Users DataFrame:
   users_id banned    role
0         1     No  client
1         2    Yes  client
2         3     No  client
3         4     No  client
4        10     No  driver
5        11     No  driver
6        12     No  driver
7        13     No  driver


In [44]:
trips = trips[trips.request_at.between('2013-10-01', '2013-10-03')].rename(columns={'request_at' : 'Day'})
users = users[users.banned == 'No']

In [46]:
trips['cancelled'] = trips.status.apply(lambda x: 0 if 'completed' in x else 1)

In [47]:
trips

Unnamed: 0,id,client_id,driver_id,city_id,status,Day,cancelled
0,1,1,10,1,completed,2013-10-01,0
1,2,2,11,1,cancelled_by_driver,2013-10-01,1
2,3,3,12,6,completed,2013-10-01,0
3,4,4,13,6,cancelled_by_client,2013-10-01,1
4,5,1,10,1,completed,2013-10-02,0
5,6,2,11,6,completed,2013-10-02,0
6,7,3,12,6,completed,2013-10-02,0
7,8,2,12,12,completed,2013-10-03,0
8,9,3,10,12,completed,2013-10-03,0
9,10,4,13,12,cancelled_by_driver,2013-10-03,1


In [51]:
summary = trips[(trips.client_id.isin(users.users_id)) & (trips.driver_id.isin(users.users_id))].groupby('Day')['cancelled'].mean().round(2).reset_index()

summary.rename(columns={'cancelled' : 'cancellation_rate'}, inplace=True)

summary

Unnamed: 0,Day,cancellation_rate
0,2013-10-01,0.33
1,2013-10-02,0.0
2,2013-10-03,0.5


In [52]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'player_id': [1, 1, 2, 3, 3],
    'device_id': [2, 2, 3, 1, 4],
    'event_date': ['2016-03-01', '2016-03-02', '2017-06-25', '2016-03-02', '2018-07-03'],
    'games_played': [5, 6, 1, 0, 5]
}

# Create the DataFrame
activity = pd.DataFrame(data)

# Convert the date column to a proper datetime format
activity['event_date'] = pd.to_datetime(activity['event_date'])

# Display the DataFrame
print(activity)

   player_id  device_id event_date  games_played
0          1          2 2016-03-01             5
1          1          2 2016-03-02             6
2          2          3 2017-06-25             1
3          3          1 2016-03-02             0
4          3          4 2018-07-03             5


np.float64(0.0)

In [None]:
import pandas as pd
import numpy as np

# Data from the table stored in a dictionary
# We use np.nan to represent the null value
data = {
    'id': [101, 102, 103, 104, 105, 106],
    'name': ['John', 'Dan', 'James', 'Amy', 'Anne', 'Ron'],
    'department': ['A', 'A', 'A', 'A', 'A', 'B'],
    'managerId': [np.nan, 101, 101, 101, 101, 101]
}

# Create the DataFrame

employees = pd.DataFrame(data)

# Display the DataFrame
print(employees)

    id   name department  managerId
0  101   John          A        NaN
1  102    Dan          A      101.0
2  103  James          A      101.0
3  104    Amy          A      101.0
4  105   Anne          A      101.0
5  106    Ron          B      101.0


In [None]:
count = employees['managerId'].value_counts()
count = count[count >= 5]
count



managerId
101.0    5
Name: count, dtype: int64

In [144]:
managers = (
    employees.groupby('managerId', as_index=False)
    .agg(direct_num = ('id', 'count'))
    .query('direct_num >= 5')
    ['managerId']
)
managers


0    101.0
Name: managerId, dtype: float64

In [145]:
result = employees[employees['id'].isin(managers)][['name']]
result

Unnamed: 0,name
0,John


In [127]:
result = (
    employees.merge(employees, left_on='id', right_on='managerId', how='left', suffixes=('_mgr', '_emp'))
    .groupby(['id_mgr', 'name_mgr'])
    .size()
    .reset_index(name='num_employees')
    .query('num_employees >= 5')[['name_mgr']]
)

result.columns = ['name']

result

Unnamed: 0,name
0,John


In [162]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'pid': [1, 2, 3, 4],
    'tiv_2015': [10, 20, 10, 10],
    'tiv_2016': [5, 20, 30, 40],
    'lat': [10, 20, 20, 40],
    'lon': [10, 20, 20, 40]
}

# Create the DataFrame
insurance = pd.DataFrame(data)

print("Insurance DataFrame:")
print(insurance)

Insurance DataFrame:
   pid  tiv_2015  tiv_2016  lat  lon
0    1        10         5   10   10
1    2        20        20   20   20
2    3        10        30   20   20
3    4        10        40   40   40


In [163]:
duplicated_tiv2015 = insurance[insurance.duplicated(subset='tiv_2015', keep=False)].pid
duplicated_tiv2015

0    1
2    3
3    4
Name: pid, dtype: int64

In [164]:
unique_latlon = insurance.drop_duplicates(subset=['lat', 'lon'], keep=False).pid
unique_latlon

0    1
3    4
Name: pid, dtype: int64

In [170]:
filtered = insurance[(insurance.pid.isin(duplicated_tiv2015)) & (insurance.pid.isin(unique_latlon))]
filtered

Unnamed: 0,pid,tiv_2015,tiv_2016,lat,lon
0,1,10,5,10,10
3,4,10,40,40,40


In [176]:
result = pd.DataFrame({'tiv_2016' : [filtered['tiv_2016'].sum().round(2)]})
result

Unnamed: 0,tiv_2016
0,45


In [177]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'order_number': [1, 2, 3, 4],
    'customer_number': [1, 2, 3, 3]
}

# Create the DataFrame
orders_df = pd.DataFrame(data)

print("Orders DataFrame:")
print(orders_df)

Orders DataFrame:
   order_number  customer_number
0             1                1
1             2                2
2             3                3
3             4                3


In [184]:
summary = (
    orders_df.customer_number.value_counts()
    .reset_index(name='order_count')
    .sort_values(by='order_count', ascending=False)
)
summary[['customer_number']].head(1)

Unnamed: 0,customer_number
0,3


In [185]:
import pandas as pd

# Data for the table
data = {
    'student': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'class': ['Math', 'English', 'Math', 'Biology', 'Math', 'Computer', 'Math', 'Math', 'Math']
}

# Create the DataFrame
courses_df = pd.DataFrame(data)

print("Courses DataFrame:")
print(courses_df)

Courses DataFrame:
  student     class
0       A      Math
1       B   English
2       C      Math
3       D   Biology
4       E      Math
5       F  Computer
6       G      Math
7       H      Math
8       I      Math


In [191]:
big_classes = (
    courses_df.groupby('class', as_index=False)
    .size()
    .query('size >=5')
)
big_classes

Unnamed: 0,class,size
3,Math,6


In [218]:
import pandas as pd

# Data from the table stored in a dictionary
data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8],
    'visit_date': ['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', 
                   '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09'],
    'people': [10, 109, 150, 99, 145, 1455, 199, 188]
}

# Create the DataFrame
stadium = pd.DataFrame(data)

# Convert the date column to a proper datetime format
stadium['visit_date'] = pd.to_datetime(stadium['visit_date'])

# Display the DataFrame
print(stadium)

   id visit_date  people
0   1 2017-01-01      10
1   2 2017-01-02     109
2   3 2017-01-03     150
3   4 2017-01-04      99
4   5 2017-01-05     145
5   6 2017-01-06    1455
6   7 2017-01-07     199
7   8 2017-01-09     188


In [None]:
high_traffic = stadium[stadium.people >= 100].sort_values('id').reset_index(drop=True)
high_traffic


Unnamed: 0,id,visit_date,people
0,2,2017-01-02,109
1,3,2017-01-03,150
2,5,2017-01-05,145
3,6,2017-01-06,1455
4,7,2017-01-07,199
5,8,2017-01-09,188


In [229]:
result = (
    high_traffic.loc[
        high_traffic.groupby(high_traffic.id - high_traffic.index)['id']
        .transform('count') >= 3]
    .sort_values('visit_date')
)

result

Unnamed: 0,id,visit_date,people
2,5,2017-01-05,145
3,6,2017-01-06,1455
4,7,2017-01-07,199
5,8,2017-01-09,188


In [230]:
import pandas as pd

# Data for the table
data = {
    'requester_id': [1, 1, 2, 3],
    'accepter_id': [2, 3, 3, 4],
    'accept_date': ['2016/06/03', '2016/06/08', '2016/06/08', '2016/06/09']
}

# Create the DataFrame
requests_accepted = pd.DataFrame(data)

# Convert the date column to a proper datetime format
requests_accepted['accept_date'] = pd.to_datetime(requests_accepted['accept_date'])

print("RequestsAccepted DataFrame:")
print(requests_accepted)

RequestsAccepted DataFrame:
   requester_id  accepter_id accept_date
0             1            2  2016-06-03
1             1            3  2016-06-08
2             2            3  2016-06-08
3             3            4  2016-06-09


In [235]:
requester = requests_accepted['requester_id'].value_counts().reset_index()
requester

Unnamed: 0,requester_id,count
0,1,2
1,2,1
2,3,1


In [234]:
accepter = requests_accepted['accepter_id'].value_counts().reset_index()
accepter

Unnamed: 0,accepter_id,count
0,3,2
1,2,1
2,4,1


In [240]:
summary = pd.DataFrame({
    'id' : requester['requester_id'].tolist() + accepter['accepter_id'].tolist(),
    'num' : requester['count'].tolist() + accepter['count'].tolist()
})

summary

Unnamed: 0,id,num
0,1,2
1,2,1
2,3,1
3,3,2
4,2,1
5,4,1


In [243]:
summary = summary.groupby('id', as_index=False)['num'].sum().sort_values('num', ascending=False)
summary.head(1)

Unnamed: 0,id,num
2,3,3


In [244]:
import pandas as pd

# Data for the SalesPerson table
sales_person_data = {
    'sales_id': [1, 2, 3, 4, 5],
    'name': ['John', 'Amy', 'Mark', 'Pam', 'Alex'],
    'salary': [100000, 12000, 65000, 25000, 5000],
    'commission_rate': [6, 5, 12, 25, 10],
    'hire_date': ['4/1/2006', '5/1/2010', '12/25/2008', '1/1/2005', '2/3/2007']
}

# Create the SalesPerson DataFrame
sales_person = pd.DataFrame(sales_person_data)

# Convert hire_date to proper datetime format
sales_person['hire_date'] = pd.to_datetime(sales_person['hire_date'])

print("SalesPerson DataFrame:")
print(sales_person)

SalesPerson DataFrame:
   sales_id  name  salary  commission_rate  hire_date
0         1  John  100000                6 2006-04-01
1         2   Amy   12000                5 2010-05-01
2         3  Mark   65000               12 2008-12-25
3         4   Pam   25000               25 2005-01-01
4         5  Alex    5000               10 2007-02-03


In [245]:
import pandas as pd

# Data for the Company table
company_data = {
    'com_id': [1, 2, 3, 4],
    'name': ['RED', 'ORANGE', 'YELLOW', 'GREEN'],
    'city': ['Boston', 'New York', 'Boston', 'Austin']
}

# Create the Company DataFrame
company = pd.DataFrame(company_data)

print("\nCompany DataFrame:")
print(company)


Company DataFrame:
   com_id    name      city
0       1     RED    Boston
1       2  ORANGE  New York
2       3  YELLOW    Boston
3       4   GREEN    Austin


In [246]:
import pandas as pd

# Data for the Orders table
orders_data = {
    'order_id': [1, 2, 3, 4],
    'order_date': ['1/1/2014', '2/1/2014', '3/1/2014', '4/1/2014'],
    'com_id': [3, 4, 1, 1],
    'sales_id': [4, 5, 1, 4],
    'amount': [10000, 5000, 50000, 25000]
}

# Create the Orders DataFrame
orders = pd.DataFrame(orders_data)

# Convert order_date to proper datetime format
orders['order_date'] = pd.to_datetime(orders['order_date'])

print("\nOrders DataFrame:")
print(orders)


Orders DataFrame:
   order_id order_date  com_id  sales_id  amount
0         1 2014-01-01       3         4   10000
1         2 2014-02-01       4         5    5000
2         3 2014-03-01       1         1   50000
3         4 2014-04-01       1         4   25000


In [248]:
merged = company.merge(orders, on='com_id')
merged

Unnamed: 0,com_id,name,city,order_id,order_date,sales_id,amount
0,1,RED,Boston,3,2014-03-01,1,50000
1,1,RED,Boston,4,2014-04-01,4,25000
2,3,YELLOW,Boston,1,2014-01-01,4,10000
3,4,GREEN,Austin,2,2014-02-01,5,5000


In [249]:
red_related = merged.loc[merged['name'] == 'RED']['sales_id']
red_related

0    1
1    4
Name: sales_id, dtype: int64

In [250]:
result = sales_person[~sales_person['sales_id'].isin(red_related)][['name']]
result

Unnamed: 0,name
1,Amy
2,Mark
4,Alex
