# [Project 1 - CMSC320](https://github.com/cmsc320/fall2022/tree/main/project1)

Posted: September 18, 2022; Due: October 1, 2022

Please read the entire document before starting!

You've been hired by a new space weather startup looking to disrupt the space weather reporting business. Your first project is to provide better data about the top 50 solar flares recorded so far than that shown by your competitor SpaceWeatherLive.com. To do this, they've pointed you to this messy HTML page from NASA ((that link is just a copy of this real NASA page but we don't want to DDoS NASA)) where you can get the extra data your startup is going to post in your new spiffy site.

Of course, you don't have access to the raw data for either of these two tables, so as an enterprising data scientist you will scrape this information directly from each HTML page using all the great tools available to you in Python. By the way, you should read up a bit on Solar Flares, coronal mass ejections, the solar flare alphabet soup, the scary storms of Halloween 2003, and sickening solar flares.

## Part 1: Data scraping and preparation

### Step 1: Scrape your competitor's data (10 pts)

In [1]:
import requests, pandas as pd, numpy as np, datetime, re
from bs4 import BeautifulSoup

In [2]:
# who needs to deal with bs4
dfs = pd.read_html(io = "https://cmsc320.github.io/files/top-50-solar-flares.html")

In [3]:
# get the data frame and rename the columns
competitor_df = dfs[0]
competitor_df.columns = ["rank","x_classification", "date", "region", "start_time", "maximum_time","end_time", "drop"]

### Step 2: Tidy the top 50 solar flare data (10 pts)

In [4]:
# drop the last column
competitor_df= competitor_df.drop(labels="drop", axis = 1)


In [5]:
# combine the date and time for start, max and end time
for rank, row in competitor_df.iterrows():
    col = ["start_time", "maximum_time", "end_time"]
    for c in col:
        dt =row["date"].split("/")
        tm = row[c].split(":")
        dt_f = datetime.datetime(*[int(i) for i in dt])
        tm_f = datetime.time(*[int(i) for i in tm])
        # print(dt_f.combine(dt_f, tm_f))
        competitor_df.at[rank,c] = dt_f.combine(dt_f, tm_f)
    # clean x_classification
    x = competitor_df.at[rank, "x_classification"]
    if "+" in x:
        competitor_df.at[rank, "x_classification"] = x[:-1]
    # clean region
    r = competitor_df.at[rank,"region"]
    if r == "-":
        competitor_df.at[rank,"region"] = 0

    
    

In [6]:
# we no longer need the date, lets drop that column
# competitor_df = competitor_df.drop(columns="date", axis = 1)

In [7]:
# rename the headers
competitor_df.columns = ["rank","x_classification", "region", "start_datetime", "max_datetime","end_datetime"]
competitor_df

Unnamed: 0,rank,x_classification,region,start_datetime,max_datetime,end_datetime
0,1,X28,486,2003-11-04 19:29:00,2003-11-04 19:53:00,2003-11-04 20:06:00
1,2,X20,9393,2001-04-02 21:32:00,2001-04-02 21:51:00,2001-04-02 22:03:00
2,3,X17.2,486,2003-10-28 09:51:00,2003-10-28 11:10:00,2003-10-28 11:24:00
3,4,X17,808,2005-09-07 17:17:00,2005-09-07 17:40:00,2005-09-07 18:03:00
4,5,X14.4,9415,2001-04-15 13:19:00,2001-04-15 13:50:00,2001-04-15 13:55:00
5,6,X10,486,2003-10-29 20:37:00,2003-10-29 20:49:00,2003-10-29 21:01:00
6,7,X9.4,8100,1997-11-06 11:49:00,1997-11-06 11:55:00,1997-11-06 12:01:00
7,8,X9.3,2673,2017-09-06 11:53:00,2017-09-06 12:02:00,2017-09-06 12:10:00
8,9,X9,930,2006-12-05 10:18:00,2006-12-05 10:35:00,2006-12-05 10:45:00
9,10,X8.3,486,2003-11-02 17:03:00,2003-11-02 17:25:00,2003-11-02 17:39:00


### Step 3: Scrape the NASA data (15 pts)

In [8]:
nasaPage = requests.get( "https://cmsc320.github.io/files/waves_type2.html")
with open("nasa.html", 'wb+')as f:
    f.write(nasaPage.content)
f.close()
with open("nasa.html", 'r') as f:
    nasaSoup = BeautifulSoup(f.read(), 'html.parser')

In [9]:
for a_tag in nasaSoup("a"):
    a_tag.unwrap()
# print(nasaSoup.text)
with open("nasaDropA.html", "w") as f:
    f.write(str(nasaSoup.text))

### Step 4: Tidy the NASA the table (15 pts)

In [10]:
from cmath import nan

with open("nasaDropA.html", 'r') as f:
    lines = list(f)
# for i in range(12):
#     print(lines[i])
headers = ["start_date", "start_time", "end_date", "end_time", "start_frequency", "end_frequency", "flare_location", "flare_region","flare_classification", "cme_date", "cme_time", "cme_angle", "cme_width","cme_speed"]
# process as 2D array+
matrix = []
for i in range(15, 533):
    line = lines[i].split()[:14]
    leng = len(line)
    for i in range(leng):
        if "-" in line[i]:
            line[i] = float(nan)
    matrix.append(line)
# print(matrix)

0       79
1      360
2      360
3      165
4      155
      ... 
513    360
514    360
515    360
516     96
517    360
Name: cme_width, Length: 518, dtype: object

In [30]:
# moving into pandas dataframe
nasa_df = pd.DataFrame(matrix, columns=headers)
nasa_df["isHalo"] = nasa_df.apply(lambda row: row.cme_angle == "Halo", axis = 1)
# def f(s):
#     return ">" in s
# nasa_df["width_lower_bound"] = nasa_df.apply(lambda row: f(row.cme_width), axis = 1)

In [31]:
# merging date and time
for rank, row in nasa_df.iterrows():
    col = ["start_time", "end_time"]

    for c in col:
        add = False
        pre = c.replace("_time", "_date")
        dt =row[pre].split("/")
        # time
        if c == "start_time":
            year = int(dt[0])
            # print(year)
        tm = row[c]
        # clean dirty timings
        if "24:" in tm:
            tm = tm.replace("24:", "00:")
            add = True
        tm = tm.split(":")

        # date
        

        if "end" in c:
             dt_f = datetime.datetime(year, *[int(i) for i in dt])
        else:
            dt_f = datetime.datetime(*[int(i) for i in dt])
        if add:
            dt_f += datetime.timedelta(days = 1)
        
        tm_f = datetime.time(*[int(i) for i in tm])
        nasa_df.at[rank,c] = dt_f.combine(dt_f, tm_f)

In [32]:
# edit angle
nasa_df["cme_angle"] = nasa_df["cme_angle"].replace("Halo", float(nan))
# nasa_df.drop(["start_date", "end_date"], axis = 1, inplace = True)
# rename datetime columns
nasa_df.rename(columns={'start_time':'start_datetime', 'end_time':'end_datetime'}, inplace=True)
# edit angle
nasa_df["cme_width"] = nasa_df["cme_width"].str.replace(">", "")
print(nasa_df)

          start_datetime         end_datetime start_frequency end_frequency  \
0    1997-04-01 14:00:00  1997-04-01 14:15:00            8000          4000   
1    1997-04-07 14:30:00  1997-04-07 17:30:00           11000          1000   
2    1997-05-12 05:15:00  1997-05-14 16:00:00           12000            80   
3    1997-05-21 20:20:00  1997-05-21 22:00:00            5000           500   
4    1997-09-23 21:53:00  1997-09-23 22:16:00            6000          2000   
..                   ...                  ...             ...           ...   
513  2017-09-04 20:27:00  2017-09-05 04:54:00           14000           210   
514  2017-09-06 12:05:00  2017-09-07 08:00:00           16000            70   
515  2017-09-10 16:02:00  2017-09-11 06:50:00           16000           150   
516  2017-09-12 07:38:00  2017-09-12 07:43:00           16000         13000   
517  2017-09-17 11:45:00  2017-09-17 12:35:00           16000           900   

    flare_location flare_region flare_classificatio

## Part 2: Analysis

In [33]:
nasa_df.sort_values(by=['flare_classification'], ascending=False, inplace =True )

In [43]:
cdf = pd.concat([nasa_df, competitor_df], axis = 1, )

In [44]:
nasa_df.to_csv("nasaCleaned.csv")

In [45]:
cdf.to_csv("combined.csv")

In [51]:
base = datetime.datetime.today()
numdays = 7
date_list0 = [base - datetime.timedelta(days=x) for x in range(numdays)]
date_list1 = [base - datetime.timedelta(days=x+1) for x in range(numdays)]
num_list0 = [11,22,33,44,55,66,77]
num_list1 = [22,33,44,55,66,77,88]
print(date_list0)

[datetime.datetime(2022, 9, 29, 17, 21, 12, 805801), datetime.datetime(2022, 9, 28, 17, 21, 12, 805801), datetime.datetime(2022, 9, 27, 17, 21, 12, 805801), datetime.datetime(2022, 9, 26, 17, 21, 12, 805801), datetime.datetime(2022, 9, 25, 17, 21, 12, 805801), datetime.datetime(2022, 9, 24, 17, 21, 12, 805801), datetime.datetime(2022, 9, 23, 17, 21, 12, 805801)]
