# 05 Obtaining and Cleaning Data
__Math 3080: Fundamentals of Data Science__

Reading:
* McKinney, Chapter 6 - Data Loading, Storage, and File Formats
* Grus, Chapter 9 - Getting Data
* Geron, Chapter 2 - End-to-End Machine Learning Project, pp. 46-51, 62-68

Outline:
1. Obtaining the data
    * Dowloading the data directly
    * HTML Scraping
    * APIs, JSON, and XML
    * Filtering the Data
2. Loading the data
    * Loading in NumPy
    * Loading in Pandas
3. Cleaning the Data

-----
We have learned about the calculations needed for basic models. Linear algebra is used frequently to handle data as well as to accomplish the mathematics involved in the models which we use to solve the statistics. We will return to linear regression later in the course, along with logistic regression and decision trees.

* Bring up the Question/Data circle: 
  * Questions --> Datasets --> Data Types --> Input/Calculations/Output
* We have talked about some of the calculations, and there will be a lot more
* An overarching concern in the entire process is __Data Wrangling__ (Draw in middle of data circle). Data Wrangling consists of
    1. Obtaining the data
    2. Cleaning the data
    3. Manipulating the data (not changing numbers, but arranging the data in useful formats)
    4. Visualization and Analysis of the data (Leads into the calculation part of the cycle)

In this segment, we will look at how to obtain and clean the data. The next two segments will look at how to manipulating and visualizing the data.

## 5.1 Obtaining the data
Where can we get data?

### Online websites (Kaggle, Data Centers)
Data is stored all over on the web. Most websites that deal with data will have a way to download the data. For example:
  * kaggle.com

Sometimes, the data is available to be displayed, but you have to copy it and put it into Excel or a text editor and save is in a format that can be loaded. For example:
  *  https://www.weather.gov/wrh/timeseries?site=K41U

This works just fine, but then every time you need some updated data, you have to capture the data, put it into excel, save it into the right format, and then load it into the computer. It would be very helpful if we could just automatically get the data. There are a couple of good ways to do this:
  * HTML Scraping: code to go through an html file and grab the printed data (done in Data Mining - 2nd semester)
  * API's (Application Programming Interfaces): some data is available online, and can be loaded directly into the program
    * INQUIRE TO SEE WHO HAS ALREADY WORKED WITH APIs

### APIs



In [7]:
import pandas as pd
import requests

In [16]:
api_url = "https://api.yelp.com/v3/businesses/search"

authorization = {'Authorization': 'Bearer ***************'}
# The authorization key is unique to the user. When you register for an account, you will be given a key.
# You shouldn't ever share your key, as others could change data with your key, and you will get the blame.

search_parameters = {
    'term': 'restaurants',
    'location': 'Ephraim, UT',
    'limit' : 30
}

response = requests.get(api_url, headers=authorization, params=search_parameters)

#response.text # Returns the text (JSON) from the request
data = response.json() # Translates the JSON file to a dictionary
data

{'businesses': [{'id': 'MdpvieSSi2Z4pREUlYnl7w',
   'alias': 'solid-rock-cafe-ephraim',
   'name': 'Solid Rock Cafe',
   'image_url': 'https://s3-media2.fl.yelpcdn.com/bphoto/t18VpatgR6oKNORma1BRCg/o.jpg',
   'is_closed': False,
   'url': 'https://www.yelp.com/biz/solid-rock-cafe-ephraim?adjust_creative=rAfNf1xUCz-EbUgzFpdQJg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=rAfNf1xUCz-EbUgzFpdQJg',
   'review_count': 15,
   'categories': [{'alias': 'cafes', 'title': 'Cafes'},
    {'alias': 'coffee', 'title': 'Coffee & Tea'},
    {'alias': 'bagels', 'title': 'Bagels'}],
   'rating': 4.5,
   'coordinates': {'latitude': 39.3597646, 'longitude': -111.5842886},
   'transactions': [],
   'price': '$',
   'location': {'address1': '96 E Center St',
    'address2': '',
    'address3': '',
    'city': 'Ephraim',
    'zip_code': '84627',
    'country': 'US',
    'state': 'UT',
    'display_address': ['96 E Center St', 'Ephraim, UT 84627']},
   'phone': '+14352830178',
   'dis

In [17]:
# Let's list the different entries in our dictionary
list(data)

['businesses', 'total', 'region']

In [18]:
data['total']

17

In [19]:
data['region']

{'center': {'longitude': -111.58058166503906, 'latitude': 39.35818120287217}}

In [20]:
# Finally, let's convert our data in the 'businesses' entry into a DataFrame:
restaurants = pd.DataFrame(data['businesses'])
restaurants

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,MdpvieSSi2Z4pREUlYnl7w,solid-rock-cafe-ephraim,Solid Rock Cafe,https://s3-media2.fl.yelpcdn.com/bphoto/t18Vpa...,False,https://www.yelp.com/biz/solid-rock-cafe-ephra...,15,"[{'alias': 'cafes', 'title': 'Cafes'}, {'alias...",4.5,"{'latitude': 39.3597646, 'longitude': -111.584...",[],$,"{'address1': '96 E Center St', 'address2': '',...",14352830178,(435) 283-0178,364.101495
1,WIWFHMIm4Ge_vf5_6XlwJw,abundance-ephraim,Abundance,https://s3-media1.fl.yelpcdn.com/bphoto/Y80DJ8...,False,https://www.yelp.com/biz/abundance-ephraim?adj...,37,"[{'alias': 'sandwiches', 'title': 'Sandwiches'...",4.5,"{'latitude': 39.3603820099702, 'longitude': -1...",[],$$,"{'address1': '27 N Main St', 'address2': '', '...",14352834734,(435) 283-4734,638.624156
2,dqkDgJWi9dkuWRcsnZTqIw,kalamas-island-style-ephraim,kalama's Island Style,https://s3-media2.fl.yelpcdn.com/bphoto/MoetKy...,False,https://www.yelp.com/biz/kalamas-island-style-...,7,"[{'alias': 'hawaiian', 'title': 'Hawaiian'}]",4.5,"{'latitude': 39.35869, 'longitude': -111.58688}",[],,"{'address1': '61 S Main St', 'address2': '', '...",14352833577,(435) 283-3577,551.054826
3,xmT_KLJ2Fh9o7ozWF-s4lQ,the-malt-shop-ephraim-2,The Malt Shop,https://s3-media1.fl.yelpcdn.com/bphoto/cpbfea...,False,https://www.yelp.com/biz/the-malt-shop-ephraim...,36,"[{'alias': 'icecream', 'title': 'Ice Cream & F...",3.5,"{'latitude': 39.36257, 'longitude': -111.5869}",[],$$,"{'address1': '150 N Main St', 'address2': '', ...",14352834101,(435) 283-4101,717.39961
4,OfPUpvh77irKr2pfEYVEWg,malenas-cafe-mexican-food-ephraim,Malena's Cafe Mexican Food,https://s3-media3.fl.yelpcdn.com/bphoto/MH48Mo...,False,https://www.yelp.com/biz/malenas-cafe-mexican-...,63,"[{'alias': 'mexican', 'title': 'Mexican'}]",4.0,"{'latitude': 39.36544, 'longitude': -111.587203}",[],$,"{'address1': '295 N Main St', 'address2': '', ...",14352834425,(435) 283-4425,987.680631
5,qIwElGbCtu5-Cv-20GQDAA,roys-pizza-and-pasta-ephraim,Roy's Pizza & Pasta,https://s3-media2.fl.yelpcdn.com/bphoto/CD90xe...,False,https://www.yelp.com/biz/roys-pizza-and-pasta-...,48,"[{'alias': 'italian', 'title': 'Italian'}, {'a...",4.0,"{'latitude': 39.35841, 'longitude': -111.5869}",[],$$,"{'address1': '81 S Main St', 'address2': '', '...",14352834222,(435) 283-4222,530.37032
6,UOInSUbHc0Lg6IblQSiSZw,snow-dragon-chinese-ephraim,Snow Dragon Chinese,https://s3-media3.fl.yelpcdn.com/bphoto/C4rWuV...,False,https://www.yelp.com/biz/snow-dragon-chinese-e...,41,"[{'alias': 'chinese', 'title': 'Chinese'}]",3.5,"{'latitude': 39.3524651, 'longitude': -111.587...",[],$$,"{'address1': '413 S Main St', 'address2': '', ...",14352836868,(435) 283-6868,843.867902
7,AhJ6yH_JjlFQ50SGqi0ngA,snocap-lanes-ephraim,SnoCap Lanes,https://s3-media1.fl.yelpcdn.com/bphoto/b0VOdr...,False,https://www.yelp.com/biz/snocap-lanes-ephraim?...,11,"[{'alias': 'bowling', 'title': 'Bowling'}, {'a...",3.5,"{'latitude': 39.3505226758956, 'longitude': -1...",[],$,"{'address1': '605 S Main St', 'address2': '', ...",14352834522,(435) 283-4522,1029.197981
8,jBgH6DkVZXkSoRMBp6negw,los-amigos-mexican-restaurant-ephraim,Los Amigos Mexican Restaurant,https://s3-media4.fl.yelpcdn.com/bphoto/VGddR0...,False,https://www.yelp.com/biz/los-amigos-mexican-re...,22,"[{'alias': 'mexican', 'title': 'Mexican'}]",3.0,"{'latitude': 39.347626, 'longitude': -111.58785}",[],$$,"{'address1': '3 E 700th S', 'address2': '', 'a...",14352835675,(435) 283-5675,1329.695605
9,sXfGLUEbdUL8tW8dcYfbgg,sip-it-soda-shack-ephraim,Sip It ! Soda Shack,https://s3-media2.fl.yelpcdn.com/bphoto/v6Vvpa...,False,https://www.yelp.com/biz/sip-it-soda-shack-eph...,5,"[{'alias': 'foodstands', 'title': 'Food Stands'}]",4.0,"{'latitude': 39.3517764076663, 'longitude': -1...",[],$,"{'address1': '455 S Main St', 'address2': '', ...",14352833435,(435) 283-3435,894.196264


And now, the data is clean and orderly, ready for us to use in our data analysis.

At this point, we are going to leave APIs. We will dig deeper into APIs in our Math 3280 Data Mining class.

## 5.2 Loading the data

In [None]:
# Loading files for reading using NumPy
import numpy as np

matrix = np.array([[1,2,3],
                   [2,3,4],
                   [3,4,5]])

np.save('data/matrix.npy', matrix)

In [None]:
load_file = np.load('data/matrix.npy')
load_file

In [None]:
np.loadtxt('data/test.txt', delimiter=' ')
# Be sure to look at the documentation for the different options you can have
  # ','
  # '\t'

In [2]:
# Loading files using pandas
import pandas as pd

df = pd.read_csv('data/test.txt', delimiter=' ')
df

Unnamed: 0,1,2,3,4,5
0,2,3,4,5,6
1,3,4,5,6,7
2,4,5,6,7,8


In [3]:
df = pd.read_csv('data/test.txt', delimiter=' ', header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,5
1,2,3,4,5,6
2,3,4,5,6,7
3,4,5,6,7,8


In [None]:
df = pd.read_csv('data/test.txt', delimiter=' ', header=None)
df.columns=[['HW 1', 'HW 2', 'HW 3', 'Quiz 1', 'Exam 1']]
df.index=[['001','002','003','004']]
df

In [None]:
df['HW 2']

In [None]:
df.loc['003']

In [None]:
df.iloc[2]

In [None]:
df.loc['003','HW 2']

In [None]:
df['HW 3'] >= 4

In [None]:
df[df['HW 3'] == 4]

## 5.3 Cleaning the data
If we are lucky, datasets are perfect and ready to use. Most of the time, however, there are problems
* Unlabeled data
* Missing data
* Unorganized



#### 5.3.1 Unlabeled data
Consider the following dataset:

In [1]:
import pandas as pd

df = pd.read_csv('../Datasets/unlabeled_data.csv', header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,71,72,73,74,75,76,77,78,79,80
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


All data should have 2 things: Labeled Columns and Documentation. Now, consider the original dataset:

In [2]:
df = pd.read_csv('../Datasets/Housing_Data/train.csv')
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


* Open the file ./math3080/Datasets/Housing_Data/data_description.txt

Every dataset should come with labeled columns and with another file that provides more information for each variable.

#### 5.3.2 Missing data
Another potential problem is that data could be unorganized. Sometimes data is just a list. Sometimes values are not in the right place. Sometimes they have the wrong units. So before anything else can be done, we have to organize the data.

How do we determine if data is missing? Missing values can have any of the following:
* An extreme number (-9999)
* NaN (Not a Number)
  * In python, a NaN is produced using `np.nan`
* Blank entries (no information) - programs usually fill these with NaN

Let's look at the following dataset on passengers of Titanic:

In [4]:
titanic = pd.read_csv('../Datasets/Titanic/titanic_train.csv')
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


There are definitely some missing values in the __Age__ and __Cabin__ columns. Are there any others?

We can test for missing values using the `.isnull()` command. The result will be a DataFrame with True/False values indicated if the value is missing (True) or if it is not missing (False).

In [5]:
titanic.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


By taking the sum of this `.isnull()` DataFrame, we can find out exactly how many elements are missing.

In [6]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can't just leave these missing values, otherwise it will affect our data. Here are 4 ways we can fill in the data:
* Filling missing values with column/row average
  * If there are few missing values, and all the other values in the column/row are random
  * Using row averages will be rare since rows are generally composed of different variables
* Filling missing values with previous/next value
  * If there are few missing values, and there is an order to the values in that variable
* Filling missing values with avg of previous and next values
  * If there are few missing values, and there is an order to the values in that variable
* Filling missing values with an interpolation of values
  * Looks at the pattern and fill with a value that fits in that pattern
* Removing NaN rows and columns
  * If it doesn't make sense to fill them
  * If that column is needed but missing values will cause problems
  * If there are too many 

Following are commands to fill missing values:

In [None]:
# Fills all missing values with a 0
titanic.fillna(value=0) 

# Fills all missing values with the value stored in "the_mean"
titanic.fillna(value=the_mean) 

# Fills all missing values in "Embarked" column with column mean
titanic['Embarked'].fillna(value=titanic['Embarked'].mean()) 

# Fills all missing values with an interpolation of the variable
titanic['Age'].interpolate(method='linear')

Following are commands to drop missing values:

In [None]:
# Drop rows with NaN values
titanic.dropna(axis=0) 

# Drop columns with NaN values
titanic.dropna(axis=1) 

# Drop columns with at least 2 NaN values
titanic.dropna(axis=1, thresh=10) 

To see how this works, let's work with a smaller dataset:

In [57]:
import numpy as np
import pandas as pd

missing_data = np.array([[1,1,2,3,5,1],
                         [1,3,np.nan,7,9,2],
                         [2,4,np.nan,8,0,3],
                         [9,np.nan,np.nan,6,5,4],
                         [4,3,2,1,np.nan,np.nan],
                         [3,7,np.nan,6,2,6]])

missing = pd.DataFrame(missing_data)
missing.columns=['Variable1','Variable2','Variable3','Variable4','Variable5','Variable6']
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,3.0,5.0,1.0
1,1.0,3.0,,7.0,9.0,2.0
2,2.0,4.0,,8.0,0.0,3.0
3,9.0,,,6.0,5.0,4.0
4,4.0,3.0,2.0,1.0,,
5,3.0,7.0,,6.0,2.0,6.0


First, let's fill the data in the Variable5 column with the column mean.

In [58]:
missing['Variable5'].fillna(value=missing['Variable5'].mean())

0    5.0
1    9.0
2    0.0
3    5.0
4    4.2
5    2.0
Name: Variable5, dtype: float64

Notice that this isn't a permanent change:

In [59]:
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,3.0,5.0,1.0
1,1.0,3.0,,7.0,9.0,2.0
2,2.0,4.0,,8.0,0.0,3.0
3,9.0,,,6.0,5.0,4.0
4,4.0,3.0,2.0,1.0,,
5,3.0,7.0,,6.0,2.0,6.0


To make it permanent, include another argument: `inplace=True`

In [60]:
missing['Variable5'].fillna(value=missing['Variable5'].mean(), inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,3.0,5.0,1.0
1,1.0,3.0,,7.0,9.0,2.0
2,2.0,4.0,,8.0,0.0,3.0
3,9.0,,,6.0,5.0,4.0
4,4.0,3.0,2.0,1.0,4.2,
5,3.0,7.0,,6.0,2.0,6.0


But if there's an obvious pattern (such as in time series data, such as stock values), we may want the filled data to match the pattern. This is the case with Variable 6:

In [61]:
missing['Variable6'].interpolate(method='linear', inplace=True)
missing

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,3.0,5.0,1.0
1,1.0,3.0,,7.0,9.0,2.0
2,2.0,4.0,,8.0,0.0,3.0
3,9.0,,,6.0,5.0,4.0
4,4.0,3.0,2.0,1.0,4.2,5.0
5,3.0,7.0,,6.0,2.0,6.0


Now, sometimes it is just easier to drop observations with missing data. This may be the case for Variable2:

In [62]:
missing.dropna(axis=0)

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,3.0,5.0,1.0
4,4.0,3.0,2.0,1.0,4.2,5.0


But this dropped all observations that had any NaN at all. We want to keep some of those observations since they have valid numbers for Variable2. We can specify to drop only variables with missing values for Variable2.

In [64]:
missing.dropna(axis=0,subset='Variable2')

Unnamed: 0,Variable1,Variable2,Variable3,Variable4,Variable5,Variable6
0,1.0,1.0,2.0,3.0,5.0,1.0
1,1.0,3.0,,7.0,9.0,2.0
2,2.0,4.0,,8.0,0.0,3.0
4,4.0,3.0,2.0,1.0,4.2,5.0
5,3.0,7.0,,6.0,2.0,6.0


Let's also look at Variable 3. Even though an average might make sense to fill the value, there are so many missing values that this column is pretty much useless. So, let's just remove it.

In [65]:
missing.drop('Variable3', axis=1)

Unnamed: 0,Variable1,Variable2,Variable4,Variable5,Variable6
0,1.0,1.0,3.0,5.0,1.0
1,1.0,3.0,7.0,9.0,2.0
2,2.0,4.0,8.0,0.0,3.0
3,9.0,,6.0,5.0,4.0
4,4.0,3.0,1.0,4.2,5.0
5,3.0,7.0,6.0,2.0,6.0


Put them together and make it permanent:

In [66]:
missing = missing.dropna(axis=0,subset='Variable2').drop('Variable3',axis=1)
missing

-----
Let's take a break. I'd love some feedback for the course so far:
* [https://drive.google.com/file/d/1wMq_eWU3jZKyALEQoRMonhZmW70n9QN5/view?usp=sharing](https://drive.google.com/file/d/1wMq_eWU3jZKyALEQoRMonhZmW70n9QN5/view?usp=sharing)

-----