### Question 1: IPO Filings Web Scraping and Data Processing

**What's the total sum ($m) of 2023 filings that happenned of Fridays?**

Re-use the [Code Snippet 1] example to get the data from web for this endpoint: https://stockanalysis.com/ipos/filings/
Convert the 'Filing Date' to datetime(), 'Shares Offered' to float64 (if '-' is encountered, populate with NaNs).
Define a new field 'Avg_price' based on the "Price Range", which equals to NaN if no price is specified, to the price (if only one number is provided), or to the average of 2 prices (if a range is given).
You may be inspired by the function `extract_numbers()` in [Code Snippet 4], or you can write your own function to "parse" a string.
Define a column "Shares_offered_value", which equals to "Shares Offered" * "Avg_price" (when both columns are defined; otherwise, it's NaN)

Find the total sum in $m (millions of USD, closest INTEGER number) for all fillings during 2023, which happened on Fridays (`Date.dt.dayofweek()==4`). You should see 32 records in total, 24 of it is not null.

(additional: you can read about [S-1 IPO filing](https://www.dfinsolutions.com/knowledge-hub/thought-leadership/knowledge-resources/what-s-1-ipo-filing) to understand the context)

In [91]:
!pip install pandas



In [92]:
import pandas as pd
import requests

import numpy as np
import pandas as pd

#Fin Data Sources
import yfinance as yf
import pandas_datareader as pdr

#Data viz
import plotly.graph_objs as go
import plotly.express as px

import time
from datetime import date, datetime


headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
}

url = "https://stockanalysis.com/ipos/filings/"
response = requests.get(url, headers=headers)

ipo_dfs = pd.read_html(response.text)
ipo_dfs

[      Filing Date Symbol                           Company Name  \
 0    Apr 26, 2024   EURK                Eureka Acquisition Corp   
 1    Apr 26, 2024    HDL    Super Hi International Holding Ltd.   
 2    Apr 22, 2024   DRJT                        Derun Group Inc   
 3    Apr 19, 2024   GPAT           GP-Act III Acquisition Corp.   
 4    Apr 16, 2024   JLJT                  Jialiang Holdings Ltd   
 ..            ...    ...                                    ...   
 324  Jan 21, 2020   GOXS                            Goxus, Inc.   
 325  Jan 21, 2020   UTXO                 UTXO Acquisition, Inc.   
 326   Dec 9, 2019   LOHA                           Loha Co. Ltd   
 327   Oct 4, 2019   ZGHB  China Eco-Materials Group Co. Limited   
 328  Dec 27, 2018   FBOX              Fit Boxx Holdings Limited   
 
         Price Range Shares Offered  
 0            $10.00        5000000  
 1                 -              -  
 2             $5.00              -  
 3            $10.00       250

In [93]:
#check datatypes of columns
ipo_dfs[0].info()
ipos_dfs = ipo_dfs[0]
ipos_dfs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329 entries, 0 to 328
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Filing Date     329 non-null    object
 1   Symbol          329 non-null    object
 2   Company Name    329 non-null    object
 3   Price Range     329 non-null    object
 4   Shares Offered  329 non-null    object
dtypes: object(5)
memory usage: 13.0+ KB


Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered
0,"Apr 26, 2024",EURK,Eureka Acquisition Corp,$10.00,5000000
1,"Apr 26, 2024",HDL,Super Hi International Holding Ltd.,-,-
2,"Apr 22, 2024",DRJT,Derun Group Inc,$5.00,-
3,"Apr 19, 2024",GPAT,GP-Act III Acquisition Corp.,$10.00,25000000
4,"Apr 16, 2024",JLJT,Jialiang Holdings Ltd,$5.00,-
...,...,...,...,...,...
324,"Jan 21, 2020",GOXS,"Goxus, Inc.",$8.00 - $10.00,1500000
325,"Jan 21, 2020",UTXO,"UTXO Acquisition, Inc.",$10.00,5000000
326,"Dec 9, 2019",LOHA,Loha Co. Ltd,$8.00 - $10.00,2500000
327,"Oct 4, 2019",ZGHB,China Eco-Materials Group Co. Limited,$4.00,4300000


In [94]:
# convert datatype of IPO Date column from string to datetime
ipos_dfs['Filing Date'] = pd.to_datetime(ipos_dfs['Filing Date'])
ipos_dfs

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered
0,2024-04-26,EURK,Eureka Acquisition Corp,$10.00,5000000
1,2024-04-26,HDL,Super Hi International Holding Ltd.,-,-
2,2024-04-22,DRJT,Derun Group Inc,$5.00,-
3,2024-04-19,GPAT,GP-Act III Acquisition Corp.,$10.00,25000000
4,2024-04-16,JLJT,Jialiang Holdings Ltd,$5.00,-
...,...,...,...,...,...
324,2020-01-21,GOXS,"Goxus, Inc.",$8.00 - $10.00,1500000
325,2020-01-21,UTXO,"UTXO Acquisition, Inc.",$10.00,5000000
326,2019-12-09,LOHA,Loha Co. Ltd,$8.00 - $10.00,2500000
327,2019-10-04,ZGHB,China Eco-Materials Group Co. Limited,$4.00,4300000


In [95]:
ipos_dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329 entries, 0 to 328
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Filing Date     329 non-null    datetime64[ns]
 1   Symbol          329 non-null    object        
 2   Company Name    329 non-null    object        
 3   Price Range     329 non-null    object        
 4   Shares Offered  329 non-null    object        
dtypes: datetime64[ns](1), object(4)
memory usage: 13.0+ KB


In [96]:
ipos_dfs['Shares Offered'] = pd.to_numeric(ipos_dfs['Shares Offered'].str.replace('-', ' '), errors='coerce')
ipos_dfs

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered
0,2024-04-26,EURK,Eureka Acquisition Corp,$10.00,5000000.0
1,2024-04-26,HDL,Super Hi International Holding Ltd.,-,
2,2024-04-22,DRJT,Derun Group Inc,$5.00,
3,2024-04-19,GPAT,GP-Act III Acquisition Corp.,$10.00,25000000.0
4,2024-04-16,JLJT,Jialiang Holdings Ltd,$5.00,
...,...,...,...,...,...
324,2020-01-21,GOXS,"Goxus, Inc.",$8.00 - $10.00,1500000.0
325,2020-01-21,UTXO,"UTXO Acquisition, Inc.",$10.00,5000000.0
326,2019-12-09,LOHA,Loha Co. Ltd,$8.00 - $10.00,2500000.0
327,2019-10-04,ZGHB,China Eco-Materials Group Co. Limited,$4.00,4300000.0


In [97]:
ipos_dfs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329 entries, 0 to 328
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Filing Date     329 non-null    datetime64[ns]
 1   Symbol          329 non-null    object        
 2   Company Name    329 non-null    object        
 3   Price Range     329 non-null    object        
 4   Shares Offered  253 non-null    float64       
dtypes: datetime64[ns](1), float64(1), object(3)
memory usage: 13.0+ KB


In [98]:
ipos_dfs.isnull().sum() #of null values in dataframe

Filing Date        0
Symbol             0
Company Name       0
Price Range        0
Shares Offered    76
dtype: int64

In [99]:
import re

def extract_numbers(input_string):
    split_string = input_string.split(" - ")
    y_match = re.search(r'(\d+.\d)', split_string[0])
    if len(split_string) > 1:
      m_match = re.search(r'(\d+.\d)', split_string[1])
      y1_number, m1_number = float(y_match.group(1)) if y_match else 0, float(m_match.group(1)) if m_match else 0
      return (y1_number + m1_number)/len(split_string)
    else:
      y0 = float(y_match.group(1)) if y_match else 0
      return y0


In [100]:
ipos_dfs['Avg_price'] = ipos_dfs['Price Range'].apply(lambda x: extract_numbers(x))
display(ipos_dfs)

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered,Avg_price
0,2024-04-26,EURK,Eureka Acquisition Corp,$10.00,5000000.0,10.00
1,2024-04-26,HDL,Super Hi International Holding Ltd.,-,,0.00
2,2024-04-22,DRJT,Derun Group Inc,$5.00,,5.00
3,2024-04-19,GPAT,GP-Act III Acquisition Corp.,$10.00,25000000.0,10.00
4,2024-04-16,JLJT,Jialiang Holdings Ltd,$5.00,,5.00
...,...,...,...,...,...,...
324,2020-01-21,GOXS,"Goxus, Inc.",$8.00 - $10.00,1500000.0,9.00
325,2020-01-21,UTXO,"UTXO Acquisition, Inc.",$10.00,5000000.0,10.00
326,2019-12-09,LOHA,Loha Co. Ltd,$8.00 - $10.00,2500000.0,9.00
327,2019-10-04,ZGHB,China Eco-Materials Group Co. Limited,$4.00,4300000.0,4.00


In [126]:
ipos_dfs['Shares_offered_value'] = ipos_dfs['Shares Offered'] * ipos_dfs['Avg_price']
ipos_dfs

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered,Avg_price,Shares_offered_value
0,2024-04-26,EURK,Eureka Acquisition Corp,$10.00,5000000.0,10.00,50000000.0
1,2024-04-26,HDL,Super Hi International Holding Ltd.,-,,0.00,
2,2024-04-22,DRJT,Derun Group Inc,$5.00,,5.00,
3,2024-04-19,GPAT,GP-Act III Acquisition Corp.,$10.00,25000000.0,10.00,250000000.0
4,2024-04-16,JLJT,Jialiang Holdings Ltd,$5.00,,5.00,
...,...,...,...,...,...,...,...
324,2020-01-21,GOXS,"Goxus, Inc.",$8.00 - $10.00,1500000.0,9.00,13500000.0
325,2020-01-21,UTXO,"UTXO Acquisition, Inc.",$10.00,5000000.0,10.00,50000000.0
326,2019-12-09,LOHA,Loha Co. Ltd,$8.00 - $10.00,2500000.0,9.00,22500000.0
327,2019-10-04,ZGHB,China Eco-Materials Group Co. Limited,$4.00,4300000.0,4.00,17200000.0


In [161]:
ipos_dfs.info()
#ipos_dfs.set_index('Filing Date')
df = ipos_dfs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 329 entries, 0 to 328
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Filing Date           329 non-null    datetime64[ns]
 1   Symbol                329 non-null    object        
 2   Company Name          329 non-null    object        
 3   Price Range           329 non-null    object        
 4   Shares Offered        253 non-null    float64       
 5   Avg_price             329 non-null    float64       
 6   Shares_offered_value  253 non-null    float64       
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 18.1+ KB


In [162]:
#extract subset if record filing in 2023

df1 = df.loc[df['Filing Date'] >= '2023-01-01']
df2 = df1.loc[df1['Filing Date'] <= '2023-12-31']
df2 = df2.reset_index(drop=True)
df2

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered,Avg_price,Shares_offered_value
0,2023-12-29,LEC,Lafayette Energy Corp,$3.50 - $4.50,1200000.0,4.0,4800000.0
1,2023-12-29,EPSM,Epsium Enterprise Limited,-,,0.0,
2,2023-12-28,ONDR,"Sushi Ginza Onodera, Inc.",$7.00 - $8.00,1066667.0,7.5,8000002.5
3,2023-12-27,JDZG,Jiade Limited,$4.00 - $5.00,2200000.0,4.5,9900000.0
4,2023-12-22,CHLW,Chun Hui Le Wan International Holding Group Ltd,-,,0.0,
...,...,...,...,...,...,...,...
114,2023-01-31,FBGL,FBS Global Limited,$4.00 - $5.00,1875000.0,4.5,8437500.0
115,2023-01-24,THNK,"T1V, Inc.",$4.00 - $6.00,3300000.0,5.0,16500000.0
116,2023-01-23,RPET,New Ruipeng Pet Group Inc.,-,,0.0,
117,2023-01-13,RVGO,"RVeloCITY, Inc.",$4.00 - $5.00,3750000.0,4.5,16875000.0


In [165]:
def numofday(dt):
  num = dt.weekday()
  return num

In [167]:
#get num of week day
df2['Day of week'] = df2['Filing Date'].apply(lambda x: numofday(x))
df2

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered,Avg_price,Shares_offered_value,Day of week
0,2023-12-29,LEC,Lafayette Energy Corp,$3.50 - $4.50,1200000.0,4.0,4800000.0,4
1,2023-12-29,EPSM,Epsium Enterprise Limited,-,,0.0,,4
2,2023-12-28,ONDR,"Sushi Ginza Onodera, Inc.",$7.00 - $8.00,1066667.0,7.5,8000002.5,3
3,2023-12-27,JDZG,Jiade Limited,$4.00 - $5.00,2200000.0,4.5,9900000.0,2
4,2023-12-22,CHLW,Chun Hui Le Wan International Holding Group Ltd,-,,0.0,,4
...,...,...,...,...,...,...,...,...
114,2023-01-31,FBGL,FBS Global Limited,$4.00 - $5.00,1875000.0,4.5,8437500.0,1
115,2023-01-24,THNK,"T1V, Inc.",$4.00 - $6.00,3300000.0,5.0,16500000.0,1
116,2023-01-23,RPET,New Ruipeng Pet Group Inc.,-,,0.0,,0
117,2023-01-13,RVGO,"RVeloCITY, Inc.",$4.00 - $5.00,3750000.0,4.5,16875000.0,4


In [174]:
#filter on day = 4
df_3 = df2.loc[df2['Day of week'] == 4] #only extract values of 4 in the day of week column
df_final = df_3[~df_3['Shares_offered_value'].isna()] #select only non null values in the shares offered value column
df_final

Unnamed: 0,Filing Date,Symbol,Company Name,Price Range,Shares Offered,Avg_price,Shares_offered_value,Day of week
0,2023-12-29,LEC,Lafayette Energy Corp,$3.50 - $4.50,1200000.0,4.0,4800000.0,4
12,2023-12-08,ENGS,Energys Group Limited,$4.00 - $6.00,2000000.0,5.0,10000000.0,4
13,2023-12-08,LNKS,Linkers Industries Limited,$4.00 - $6.00,2200000.0,5.0,11000000.0,4
34,2023-10-27,RAY,Raytech Holding Limited,$4.00 - $5.00,1500000.0,4.5,6750000.0,4
41,2023-10-13,ORIS,Oriental Rise Holdings Limited,$4.00,2000000.0,4.0,8000000.0,4
44,2023-10-06,QMMM,QMMM Holdings Limited,$4.00,2125000.0,4.0,8500000.0,4
48,2023-09-29,KAPA,"Kairos Pharma, Ltd.",$4.00,1550000.0,4.0,6200000.0,4
49,2023-09-29,VAPA,Valens Pay Global Limited,$5.00 - $6.00,1000000.0,5.5,5500000.0,4
56,2023-09-15,ACSB,Acesis Holdings Corporation,$4.00 - $6.00,1300000.0,5.0,6500000.0,4
73,2023-07-07,AZI,Autozi Internet Technology (Global) Ltd.,$4.00 - $5.00,1250000.0,4.5,5625000.0,4


In [185]:
total = df_final['Shares_offered_value'].sum()
f"total sum in millions: ${round(total/1000000)}M"


'total sum in millions: $276M'