<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---

<CENTER>
<H1> <font color="red" size="+3">
    Web Scraping with Python </font> 
    <br>
    Helio Applications
</H1>
</CENTER>

## <font color='red'>Reference Documents</font>

- [Web Scraping: What It Is and How to Use It](https://scrape-it.cloud/blog/web-scraping-what-it-is-and-how-to-use-it)
- [What is web scraping](https://www.scrapehero.com/a-beginners-guide-to-web-scraping-part-1-the-basics/)
- [Python Requests Tutorial](http://zetcode.com/python/requests/)
- [Python’s Requests Library (Guide](https://realpython.com/python-requests/)
- [Building a Web Scraper from start to finish](https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184)

## <font color='red'>Objectives</font>

In this course, we want to describe web scraping and show how it can be accomplished with Python. We present the basic steps of web scraping and run examples on accessing HTTP servers, grabbing the content of web pages (in JSON format), parsing the content to extract useful information and performing analyses.

The following topics will be covered:

+ What is web scraping?
+ What is JSON?
+ Accessing Web Pages with `requests`
+ Web Scraping with `Json`

We expect that at end of this presentation, participants will be able to write Python scripts that automatically perform web scraping to extract specific data from webpages.

## <font color='red'>Required Packages</font>
We will need the Python packages:

- `requests`: for accessing servers and getting the contents of web pages.
- `json`: for manipulating JSON documents.
- `BeautifulSoup`: for pulling information out of HTML and XML files. 
- `Pandas`: for data analysis and manipulation.
- `Matplotlib`: for creating visualization.
- `Seaborn`: for drawing attractive and informative statistical graphics.
- `Plotly`: for making interactive graphs.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
import pprint
import re

In [None]:
import requests as reqs

In [None]:
import json

In [None]:
import matplotlib.pyplot as plt

In [None]:
import numpy as np
import pandas as pd

In [None]:
import seaborn as sns

In [None]:
import plotly.express as px

In [None]:
from bs4 import BeautifulSoup as bso

In [None]:
from IPython.display import HTML

In [None]:
#print(f"Requests version:  {reqs.__version__}")
#print(f"JSON version:      {json.__version__}")

# <font color='red'>Accessing web servers with `requests` </font>

* Requests is a built-in Python module.
* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* The HTTP request returns a Response Object with all the response data (content, encoding, status, etc.).

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

### Content-Type

- It is part of the HTTP header.
   - A string used to indicate the media type of the resource you want to access.  
   - It tells the browser the type of content it has to load on the machine. 
- Here are some values of `content-type`:

```html
   text/html
   text/csv
   application/json
   application/javascript
   audio/ogg
   image/png
```

- While scraping a web page, it is used to determine which tool is needed to parse the content of a web page.

#### Function to accesss a web page
We want to pass the `payload` and `timeout` as arguments of the function.

In [None]:
def access_website(url: str, payload: dict=None, timeout: int=10):
    """
    Attempt to access a server. If the attempt is successful,
    return the response object, otherwise return an error message.

    Parameters
    ----------
    url : str
       HTTP address of the web page we want to access
    payload : dict
       Parameters needed to construct the target url.
    timeout: int
       Maximum time (in seconds) to access a website and grab its content.

    Returns
    -------
    resp : object
       Object which has infomation on the web page of interest.
    """
    try:
        if payload:
            resp = reqs.get(url, params=payload, timeout=timeout)
        else:
            resp = reqs.get(url, timeout=timeout)
        resp.raise_for_status()
    except reqs.exceptions.HTTPError as errh:
        print(f"Http Error: {errh}")
    except reqs.exceptions.ConnectionError as errc:
        print(f"Error Connecting: {errc}")
    except reqs.exceptions.Timeout as errt:
        print(f"Timeout Error: {errt}")
    except reqs.exceptions.RequestException as err:
        print(f"General Error: {err}")
    else:
        print(f"Successfully accessed the site: \n\t {resp.url}")
        print(f"Content type: \n\t {resp.headers['content-type']}")
    return resp

---

## <font color="red">Accessing the DONKI webservice API</font>

The Space Weather Database Of Notifications, Knowledge, Information (DONKI), developed at the Community Coordinated Modeling Center (CCMC), is a comprehensive on-line tool for space weather forecasters, scientists, and the general space science community.

DONKI provides:
- Chronicles the daily interpretations of space weather observations, analysis, models output, and notifications provided by the Moon to Mars Space Weather Analysis Office as a courtesy to the community.
- Comprehensive knowledge-base search functionality to support anomaly resolution and space science research.
- Intelligent linkages, relationships, cause-and-effects between space weather activities.
- Comprehensive webservice API access to information stored in DONKI


__Our goal__:
- We want to be able to obtain from the webpage [https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME](https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME) the Coronal Mass Ejection (CME) data
- We access the webpage (using a set of parameters) and retrieve the content of the page as a JSON object.


### <font color="blue">Access CME Web Database Server</font>

The based URL is: `https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME`

To access the data, we need to provide two parameters:

- `startDate` (YYYY-MM-DD): default to 30 days prior to current UTC date
- `endDate` (YYYY-MM-DD): default to current UTC date

In [None]:
url_cme = "https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME"

In [None]:
startDate="2017-01-01"
endDate="2017-03-31"

payload = {
    'startDate': startDate,
    'endDate': endDate,
}

page_content = access_website(url_cme, payload, timeout=150)

### <font color="blue">Process the data with JSON</font>

- We have a JSON object that we need to convert into a Python object.
- We need to use the `loads()` function (deserialization).

In [None]:
json_page = json.loads(page_content.text)

Whenever we have a Python object, it is good to determine its type before proceeding.

In [None]:
type(json_page)

What is the length of the list?

In [None]:
len(json_page)

Let us look at the first entry of the list:

In [None]:
pprint.pprint(json_page[0])

It seems that there is was no activity on `2017-01-01T00:39Z`: 

```
  'cmeAnalyses': None
```

In [None]:
one_activity = json_page[1]

In [None]:
pprint.pprint(one_activity)

### <font color="blue">Basic data extraction from one CME activity</font>

We need to first get the keys of the dictionary:

In [None]:
one_activity.keys()

We can focus on:

- `'cmeAnalyses'`
- `'startTime'`
- `'instruments'`

__Get instruments__

In [None]:
one_activity['instruments']

In [None]:
for i, item in enumerate(one_activity['instruments'], start=1):
    print(f"Instrument {i}- {item['displayName']}")

__Get start date/time__

In [None]:
one_activity['startTime']

__Get CME activity data__

In [None]:
pprint.pprint(one_activity['cmeAnalyses'])

We want to be able to extract the following parameters:
- `speed`
- `longitude`
- `latitude`
- `halfAngle`

In [None]:
type(one_activity['cmeAnalyses'])

In [None]:
len(one_activity['cmeAnalyses'])

In [None]:
pprint.pprint(one_activity['cmeAnalyses'][0])

In [None]:
latitude = one_activity['cmeAnalyses'][0]['latitude']
longitude = one_activity['cmeAnalyses'][0]['longitude']
halfAngle = one_activity['cmeAnalyses'][0]['halfAngle']
speed = one_activity['cmeAnalyses'][0]['speed']

In [None]:
print(f"  latitude: {latitude}")
print(f" longitude: {longitude}")
print(f" halfAngle: {halfAngle}")
print(f"     speed: {speed}")

### <font color="blue">Data Analysis</font>

- We want to inspect the obtained Python list and identify the dates/times when there were CME activities.
- For each date/time when there was an activity, we will extract from the list:
    - The `date/time`, the `speed`, the latitude, the longitude, and the hall angle
    - The the list of instruments 

In [None]:
def get_CME_parameters(cme_activity: dict):
    """
    From a dictionary containing CME activity data on a specific date/time,
    extract the following parameters:

    - starting date/time
    - latitude
    - longitude
    - halfAngle
    - speed
    """

    latitude = cme_activity['cmeAnalyses'][0]['latitude']
    longitude = cme_activity['cmeAnalyses'][0]['longitude']
    half_angle = cme_activity['cmeAnalyses'][0]['halfAngle']
    speed = cme_activity['cmeAnalyses'][0]['speed']
    start_time = cme_activity['startTime']
    list_instruments = list()
    for item in cme_activity['instruments']:
        list_instruments.append(item['displayName'])
    return start_time, speed, latitude, longitude, half_angle, tuple(sorted(list_instruments))

In [None]:
get_CME_parameters(one_activity)

In [None]:
def create_df(cme_activities: list):
    """
    Using a list of CME activities, create a Pandas DataFrame
    with columns:
    
    - star_time
    - latitude
    - longitude
    - half_angle
    - speed
    """
    columns = ["start_time", "speed", "latitude", "longitude", "half_angle", "instruments"]
    df = pd.DataFrame(columns=columns)

    # Loop over the CME event
    for cme_activity in cme_activities:
        # Only process when a CME activity was recorded.
        if cme_activity['cmeAnalyses']:   # This mean that an activity was recorded
            df.loc[len(df)] = get_CME_parameters(cme_activity)
    return df     

In [None]:
df = create_df(json_page)

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

We need to convert the `start_time` into a datetime object:

In [None]:
df['start_time'] = pd.to_datetime(df['start_time'])

In [None]:
df

Set the `start_time` column as index of the Pandas DataFrame:

In [None]:
df.set_index('start_time', inplace=True)
df

#### Write a function that accesses the DONKI API and collects CME activities

In [None]:
def collect_cme_activities(start_date: str, end_date: str):
    """
    Use web scraping to gather from DONKI all the CME activities
    between the beginning date (start_date) and the end date (end_date).

    Parameters
    ----------
    start_date : str
       Beginning date in the format YYYY-MM-DD
    end_date : str
       End date in the format YYYY-MM-DD

    Returns
    -------
    df : Pandas DataFrame
       Pandas DataFrame containg parameters of CME activities.
    """
    url_cme = "https://kauai.ccmc.gsfc.nasa.gov/DONKI/WS/get/CME"
    payload = {
        'startDate': start_date,
        'endDate': end_date
    }

    page_content = access_website(url_cme, payload, timeout=150)
    json_page = json.loads(page_content.text)
    df = create_df(json_page)

    df['start_time'] = pd.to_datetime(df['start_time'])
    df.set_index('start_time', inplace=True)
    return df

In [None]:
df = collect_cme_activities("2017-01-01", "2019-12-31")

In [None]:
df

__Get basic information on columns using the `info()` method__

In [None]:
df.info()

__Observations__: 

- The values of each column have the same data type.
- There are 330 data points (rows).
- The columns `latitude` and `longitude` have 329 and 325 values respectively.
   - The two columns have missing values.
   - If the two columns are important for your analysis, you need to decide what to do with missing values.

__To get the number of missing values in each column:__

In [None]:
df.isnull().sum()

__Basic descriptive statistics__

In [None]:
df.describe()

### <font color="green">Focus on the instruments</font>

__Get the the unique instruments__

In [None]:
set_instruments = set()

for item in df['instruments'].values:
    for inst in item:
        set_instruments.add(inst)

print(f"Set of instruments: \n\t {sorted(list(set_instruments))}")

__Combined sets of instruments__

In [None]:
df_instr = df.groupby('instruments')

In [None]:
df_instr

In [None]:
df_instr.groups.keys()

In [None]:
df_instr.count()["speed"]

In [None]:
plt.pie(df_instr.count()["speed"], 
        labels=list(df_instr.groups.keys()), 
         autopct='%.0f%%') 
plt.title("Pie Chart of Unique Combined Sets of Instruments");

In [None]:
fig = px.pie(df_instr.count(), 
             values="speed",
             names=list(df_instr.groups.keys()),
             title="Pie Chart of Unique Combined Sets of Instruments",
            width=700)
fig.show()

__Count the number times each instrument is used__

We will create a dictionary with key as the instrument name and value the number of times used.

In [None]:
dict_instruments = dict()

for item in df['instruments'].values:
    for inst in item:
        if inst in dict_instruments:
            dict_instruments[inst] += 1
        else:
            dict_instruments[inst] = 1

In [None]:
dict_instruments

In [None]:
plt.bar(dict_instruments.keys(), dict_instruments.values());

In [None]:
dict_instruments.values()

In [None]:
plt.pie(list(dict_instruments.values()), 
        labels=list(dict_instruments.keys()), 
         autopct='%.0f%%') 
plt.title("How many times instruments are used");

### <font color="green"> Plot time series values of the speed</font>

Using Matplotlib:

In [None]:
fig, ax = plt.subplots(figsize=(12, 7))
ax.scatter(df.index, df["speed"], s=7)
ax.set_xlabel("CME Start Date")
ax.set_ylabel("CME Speed (km/s)")
ax.set_title(f"DONKI CME Database \n Date vs CME Speed")
ax.grid()

Using Seaborn:

In [None]:
fig = plt.figure(figsize=(12, 7))
ax = sns.scatterplot(x=df.index, y=df['speed'])
ax.set_xlabel("CME Start Date")
ax.set_ylabel("CME Speed (km/s)")
ax.set_title(f"DONKI CME Database \n Date vs CME Speed")
ax.grid()

Using Plotly (interactive plot):

In [None]:
fig = px.scatter(x=df.index, y=df['speed'],
                width=800, height=600)
fig.show()

__Create histograms__

In [None]:
df["speed"].hist(bins=5)

In [None]:
fig = px.histogram(df, x="speed", nbins=5)
fig.show()

__Create a new column to categorize speed into ranges__

| Type | Speed range |
| --- | --- |
| `S-type` | $<500$ km/s |
| `C-type` | $500-999$ km/s |
| `0-type` | $1000-1999$ km/s |
| `R-type` | $2000-2999$ km/s |
| `ER-type` | $>3000$ km/s |

In [None]:
speed_bins = [0, 500, 1000, 2000, 3000, float('inf')]
speed_labels = [
    'S-type (<500 km/s)', 
    'C-type (500-999 km/s)', 
    'O-type (1000-1999 km/s)', 
    'R-type (2000-2999 km/s)', 
    'ER-type (>3000 km/s)'
]
assert(len(speed_bins)-1 == len(speed_labels))

In [None]:
df['speed_range'] = pd.cut(
    df['speed'], 
    bins=speed_bins,
    labels=speed_labels
)
df

In [None]:
fig = px.histogram(df, x="speed_range", color="speed_range")
fig.show()

In [None]:
fig = px.bar(df, x='speed_range', y='speed', color="speed_range", text='speed', height=600)
fig.show()

In [None]:
fig = px.bar(df, x=df.index.year, y='speed', color="speed_range", height=600)
fig.show()

#### Plot the time series values of speed by categories

In [None]:
fig = plt.figure(figsize=(12, 7))
ax = sns.scatterplot(data=df, x=df.index, y='speed', hue='speed_range')
ax.set_xlabel("CME Start Date")
ax.set_ylabel("CME Speed (km/s)")
ax.set_title(f"DONKI CME Database \n Date vs CME Speed")
ax.grid()

#### Longitude vs Speed plot (polar plot)

In [None]:
g = sns.FacetGrid(df, subplot_kws=dict(projection='polar'), 
                  height=5, sharex=False, sharey=False, despine=False)
g.map_dataframe(sns.scatterplot, 
                x='longitude', 
                y='speed', 
                hue='speed_range');

We can generate the same polar plot with `Plotly` (note the interactivity):

In [None]:
fig = px.scatter_polar(df, 
                       r='speed', 
                       theta='longitude', 
                       color='speed_range', 
                       width=700, height=700)

#fig.update_layout(showlegend=False)
fig.show()

### <font color="green">Gather statistics</font>

__CME for a specific month__

In [None]:
df[df.index.month==2]

In [None]:
month_count_df = pd.DataFrame(columns=["Month", "Count"])
for i in range(1,13):
    month_count_df.loc[len(month_count_df)] = [i, len(df[df.index.month==i])]
month_count_df

In [None]:
month_count_df["Month"] = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                           "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

In [None]:
# Create the bar plot
sns.barplot(data=month_count_df, x='Month', y='Count', hue="Count", legend=False)

# Add title and labels
#plt.title('Bar Chart Example')

In [None]:
fig = px.bar(month_count_df, x="Month", y='Count', height=500)
fig.show()

__Data on a specific day__

In [None]:
df.loc['2019-05-13']

__Data over a specific range of date__

In [None]:
df.loc['2018-01-01 09:00:00':'2018-04-14 11:00:00']

__Data for a specific year/month__

In [None]:
df.loc['2018-01']

__Data with a specific speed__

In [None]:
df[df.speed==350.0]

__Speed monthly and yearly means__

In [None]:
speed_monthly_means = df.speed.resample('M').mean()

In [None]:
speed_monthly_means.plot()

In [None]:
df.speed.resample('Y').mean()

__Group by speed range__

In [None]:
df_range = df.groupby('speed_range')

In [None]:
list_groups = list(df_range.groups.keys())
list_groups

In [None]:
df_range.count()["speed"]

In [None]:
df_S = df_range.groups['S-type (<500 km/s)']

In [None]:
df_S

In [None]:
len(df_S[df_S.slice_indexer("2017-01-01", "2017-12-31")])

In [None]:
len(df_S[df_S.slice_indexer("2018-01-01", "2018-12-31")])

In [None]:
len(df_S[df_S.slice_indexer("2019-01-01", "2019-12-31")])

In [None]:
fig = px.pie(df_range.count(), values="speed", names=list_groups)
fig.show()

---
---

# <font color="red">Search and plot GIF images</font>

Here we focus on a one-day CME activity and perform web scraping to extract the gif images generated on the activity.

In [None]:
startDate="2017-01-03"
endDate="2017-01-03"

payload = {
    'startDate': startDate,
    'endDate': endDate,
}

page_content = access_website(url_cme, payload, timeout=15)

#### Task 1: Get link associated with the `enlilList` key

In [None]:
page = json.loads(page_content.text)[0]

In [None]:
page['cmeAnalyses']

In [None]:
pprint.pprint(page['cmeAnalyses'][0])

__Extract the link of interest__

In [None]:
url_link = page['cmeAnalyses'][0]['enlilList'][0]['link']
url_link

#### Task 2: Get the gif images contain in the webpage

__Access the webpage and get its content__

In [None]:
source = access_website(url_link)

__Use BeautifulSoup to obtain the HTML document__

In [None]:
mysoup = bso(source.text, 'html.parser')

In [None]:
print(mysoup.prettify())

__Parse the content of the HTML document to extract all the web links that have gif images__

In [None]:
list_images = [a['href'] for a in mysoup.select('a[href]') if "iswa" in a['href']]

In [None]:
len(list_images)

In [None]:
list_images

#### Task 3: Display some of the remote images

In [None]:
HTML(f'<img src={list_images[0]}>')

In [None]:
HTML(f'<img src={list_images[1]}>')

In [None]:
HTML(f'<img src={list_images[2]}>')

In [None]:
HTML(f'<img src={list_images[3]}>')

In [None]:
HTML(f'<img src={list_images[7]}>')

---

# <font color="red"> Breakout</fon>

This exercise is inspired by:


[Project 1: Data scraping and cleaning](https://www.cs.umd.edu/class/spring2017/cmsc320/hw/Project1.html)

A possible solution can be seen at:

[NASA Data Scraping, Cleaning, & Data Analysis](https://www.kaggle.com/code/fareselsalamouny/nasa-data-scraping-cleaning-data-analysis)


## <font color="blue"> Part 1:</font> Scrape data for the top 50 solar flares

Access the website:

[https://www.spaceweatherlive.com/en/solar-activity/top-50-solar-flares.html](https://www.spaceweatherlive.com/en/solar-activity/top-50-solar-flares.html)

to create a Pandas DataFrame.


In [None]:
import http.client as http_client
http_client.HTTPConnection.debuglevel = 0 #1

In [None]:
spaceweather_url = "https://www.spaceweatherlive.com/en/solar-activity/top-50-solar-flares"

In [None]:
spaceweather_resp = access_website(spaceweather_url)

In [None]:
spaceweather_soup = bso(spaceweather_resp.content, 'html5lib')

In [None]:
#spaceweather_soup

In [None]:
print(spaceweather_soup.prettify())

__Extract the HTML table from the web document__

In [None]:
html_table = spaceweather_soup.find('table')

In [None]:
type(table)

__Create a DataFrame__

In [None]:
spaceweather_columns = [
    'rank', 
    'x_class', 
    'date', 
    'region', 
    'start_time', 
    'max_time', 
    'end_time', 
    'movie'
]

In [None]:
spaceweather_df = pd.DataFrame(columns=spaceweather_columns, index = range(0,50))

# Loop over rows of HTML table to extract the text and populate the DataFrame
row_marker = 0
for row in html_table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    row_marker += 1
    for column in columns:
        if (row_marker > 1):
            spaceweather_df.iat[row_marker-2,column_marker] = column.get_text()
            column_marker += 1

In [None]:
spaceweather_df

__Drop the last column (`movie`)__

In [None]:
spaceweather_df = spaceweather_df.drop(spaceweather_columns[-1], 1)

In [None]:
spaceweather_df.info()

__Combine the date column with the time columns to convert to datetime format__

In [None]:
spaceweather_df['start_time'] = pd.to_datetime(spaceweather_df['date'] + ' ' + spaceweather_df['start_time'])

In [None]:
spaceweather_df['max_time'] = pd.to_datetime(spaceweather_df['date'] + ' ' + spaceweather_df['max_time'])

In [None]:
spaceweather_df['end_time'] = pd.to_datetime(spaceweather_df['date'] + ' ' + spaceweather_df['end_time'])

__Drop the `date` column__

In [None]:
spaceweather_df.drop('date', 1)

__Change the order of the columns__

In [None]:
spaceweather_df = spaceweather_df[['rank', 'x_class', 'start_time', 'max_time', 'end_time', 'region']]

__Replace all instances of `'-'` with `'<NA>'`__

In [None]:
spaceweather_df = spaceweather_df.replace('-', '<NA>')

In [None]:
spaceweather_df

## <font color="blue"> Part 2:</font> Scrape the NASA data

Access the webpage:

[https://cdaw.gsfc.nasa.gov/CME_list/radio/waves_type2.html](https://cdaw.gsfc.nasa.gov/CME_list/radio/waves_type2.html)

to scrap the STEREO and Wind WAVES type II bursts and the associated CMEs. 

In [None]:
nasa_url = "https://cdaw.gsfc.nasa.gov/CME_list/radio/waves_type2.html"

In [None]:
nasa_resp = access_website(nasa_url)

In [None]:
nasa_bs = bso(nasa_resp.content, 'html.parser')

In [None]:
#print(nasa_bs.prettify())

__Extract the HTML table__

In [None]:
nasa_table = nasa_bs.find('pre').text

In [None]:
#print(nasa_table)

__Convert the table to a list__

- Do not take the first 11 rows of the table.
- Do not take any row that has `===`.

In [None]:
nasa_table_list = [x.split() for x in nasa_table.split('\n')[11:] if "===" not in x and x]
#nasa_table_list

__Create the DataFrame__

- Only consider the first 14 columns.

In [None]:
nasa_df = pd.DataFrame(nasa_table_list).loc[:,0:13]

In [None]:
nasa_df

In [None]:
nasa_df.info()

__Provide the labels for the columns__

In [None]:
columns = [
    "start_date", #Column 1: Starting date of the type II burst (yyyy/mm/dd format)
    "start_time", #Column 2:  Starting time (UT) of the type II burst (hh:mm format)
    "end_date", #Column 3:  Ending date of the type II burst (mm/dd format; year in Column 1 applies)
    "end_time", #Column 4:  Ending time of the Type II burst (hh:mm  format)
    "start_frequency", #Column 5:  Starting frequency of type II burst (kHz) [1]
    "end_frequency", #Column 6:  Ending frequency of type II burst (kHz) [1]
    "flare_location", #Column 7:  Solar source location (Loc) of the associated eruption in heliographic coordinates [2]
    "flare_region",  #Column 8:  NOAA active region number (NOAA) [3]
    "flare_classification",  #Column 9:  Soft X-ray flare importance (Imp)  [4]
    "cme_date",   #Column 10: Date of the associated CME (mm/dd format, Year in Column 1 applies) [5]
    "cme_time",   #Column 11: Time of the associated CME (hh:mm format)
    "cme_angle",  #Column 12: Central position angle (CPA, degrees) for non-halo CMEs [6]
    "cme_width",  #Column 13: CME width in the sky plane (degrees) [7]
    "cme_speed"   #Column 14: CME speed in the sky plane (km/s)
]

nasa_df.columns = columns
nasa_df

__Condition [1]__

In [None]:
nasa_df["start_frequency"] = nasa_df["start_frequency"].apply(lambda x: None if '?' in x else x)

In [None]:
nasa_df["end_frequency"] = nasa_df["end_frequency"].apply(lambda x: None if '?' in x else x)

__Condition [3]__

In [None]:
nasa_df["flare_region"] = nasa_df["flare_region"].apply(lambda x: None if '--' in x else x)

In [None]:
nasa_df.head(6)

__Condition [4]__

In [None]:
#nasa_df["flare_classification"] = nasa_df["flare_classification"].apply(lambda x: None if '--' in x else x)

Only get the rows that have `X` in classification flare

In [None]:
flares_df = nasa_df.loc[nasa_df['flare_classification'].str.contains('X')]

In [None]:
flares_df

Get rid of X for sorting purposes

In [None]:
flares_df['flare_classification'] = flares_df['flare_classification'].str.lstrip('X')

Convert to float

In [None]:
flares_df['flare_classification'] = flares_df.flare_classification.astype(float)

Sort rows by flare class

In [None]:
flares_df = flares_df.sort_values('flare_classification', ascending = False)

In [None]:
flares_df

__Only consider the top 50__

In [None]:
flares_df = flares_df.head(50)

Reconvert back to string

In [None]:
flares_df['flare_classification'] = flares_df.flare_classification.astype(str)

Add the `X` before any classification value

In [None]:
flares_df['flare_classification'] = "X" + flares_df['flare_classification']

In [None]:
flares_df