In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


# Starmap

## Data Retrieval

My goal with this project was to create a star map that included the [88 modern constellations](https://en.wikipedia.org/wiki/88_modern_constellations). Each Constellation is made up of dozens of individual stars, each star has a name, [right ascension](https://en.wikipedia.org/wiki/Right_ascension), [declination](https://en.wikipedia.org/wiki/Declination), and [apparent magnitude](https://en.wikipedia.org/wiki/Apparent_magnitude).

I decided to scrape this information off of wikipedia and put it in a pandas `DataFrame` before processing it further.

In [2]:
import requests
from bs4 import BeautifulSoup

wiki = 'https://en.wikipedia.org/wiki/Lists_of_stars_by_constellation'
header = {'User-Agent': 'Mozilla/5.0'}

page = requests.get(wiki, headers=header).text
soup = BeautifulSoup(page, 'html.parser')

stars = []

for constellation in soup.table.find_all('a'):
    page = requests.get('https://en.wikipedia.org' + constellation.attrs['href'], headers=header).text
    starsoup = BeautifulSoup(page, 'html.parser')

    titles = list(x.attrs['title'].lower() for x in starsoup.table('tr')[0](title=True))
    
    namecol = 0
    racol = titles.index('right ascension')
    deccol = titles.index('declination')
    amcol = titles.index('apparent magnitude')
            
    for star in starsoup.table('tr')[1:-3]:
        row = star('td')

        try:
            name = row[namecol].a['title']
        except (TypeError, KeyError):
            name = ''
            
        right_ascension = row[racol].text
        declination = row[deccol].text
        apparent_magnitude = row[amcol].text
        parent_constellation = constellation.string

        stars.append([name, right_ascension, declination, apparent_magnitude, parent_constellation])

## Data Cleaning

Now that we have the data let's put it in a `DataFrame` to clean it up.

1. missing values
1. (page not found) stripping
1. footnotes
1. coordinate parsing

In [3]:
import pandas as pd

starsdf = pd.DataFrame(stars, columns=['Name', 'Right Ascension', 'Declination', 'Apparent Magnitude', 'Constellation'])
   
# remove ' (page does not exist)' string from end of some names
starsdf.Name = starsdf.Name.str.replace(' \(page does not exist\)', '')

# 
starsdf['Apparent Magnitude'] = starsdf['Apparent Magnitude'].str.replace('−', '-')
starsdf['Apparent Magnitude'] = starsdf['Apparent Magnitude'].str.replace('~', '')
starsdf['Apparent Magnitude'] = starsdf['Apparent Magnitude'].str.replace('n/a', '')
starsdf['Apparent Magnitude'] = starsdf['Apparent Magnitude'].str.extract('([-+]?\d*\.\d+|\d+)', expand=False)

#
starsdf.Constellation = starsdf.Constellation.astype('category')

#
starsdf['Declination'] = starsdf['Declination'].str.replace('−', '-')
starsdf['Declination'] = starsdf['Declination'].str.replace(' ″', '″')

#
starsdf['Right Ascension'] = starsdf['Right Ascension'].str.replace(' s', 's')

# remove missing value rows
for key in starsdf.keys():
    starsdf.loc[starsdf[key] == ''] = np.nan
 
starsdf['Apparent Magnitude'] = starsdf['Apparent Magnitude'].astype(float)

starsdf.dropna(inplace=True)

starsdf

Unnamed: 0,Name,Right Ascension,Declination,Apparent Magnitude,Constellation
0,Alpha Andromedae,00h 08m 23.17s,+29° 05′ 27.0″,2.07,Andromeda
1,Beta Andromedae,01h 09m 43.80s,+35° 37′ 15.0″,2.07,Andromeda
2,Gamma Andromedae,02h 03m 53.92s,+42° 19′ 47.5″,2.10,Andromeda
3,Delta Andromedae,00h 39m 19.60s,+30° 51′ 40.4″,3.27,Andromeda
4,Andromeda Galaxy,00h 42m 44.31s,+41° 16′ 09.4″,3.44,Andromeda
5,51 Andromedae,01h 37m 59.50s,+48° 37′ 42.6″,3.59,Andromeda
6,Omicron Andromedae,23h 01m 55.25s,+42° 19′ 33.5″,3.62,Andromeda
7,Lambda Andromedae,23h 37m 33.71s,+46° 27′ 33.0″,3.81,Andromeda
8,Mu Andromedae,00h 56m 45.10s,+38° 29′ 57.3″,3.86,Andromeda
9,Zeta Andromedae,00h 47m 20.39s,+24° 16′ 02.6″,4.08,Andromeda


In [5]:
starsdf.to_csv('constellations.csv', index=False)