# <font color='gray'> Lab 1: Basic Data Exploration & Visualisation: Getting to Know Your Data </font>

## 0. Introduction 

The purpose of this introductory lab is to familiarise you with  **basic exploratory data analysis** of a sample dataset. We will first be using  the Python's **pandas** package in this **notebook** environment. Next, we use  ***Weka***, a java-based data-mining software with a GUI. 

*This session <ins>is a warm-up that will not count toward your final grade</ins>.*



**Before we start:** this environment that allows us to enter both text and run codes interactively, is called ***[notebook](https://jupyter.org/)***).

There are two types of cells: *Text* and *Code*. You can add your own cells. You can also edit the texts by double-clicking on them. It follows the [markup rules](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet). 

In order to execute (run) a cell, you can use one of the following ways:

0. `Shift + Enter` : executes a cell and goes to the next one.
1. `Ctrl + Enter` : executes a block but stays at the same block. This is equivalent to clicking on the *run* butten to the left of the cell, which appears when you hover the mouse over the `[ ]` icon. 
2. Use the `Runtime` tab (at the top of the page), which gives you more options as well.

Now, test your knowledge(!) by running the next cell (the output should appear right after):

In [0]:
for i in range(5):
  print('{:2d} - Welcome to the Data Mining lab!'.format(2**i))

As you might have noticed, we will be using Python codes. Let's have a word on why `Python`? 

*Well, for one, it is the most preferred language among data scientist, according to [this poll](https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html).
Also it helps that it is a general purpose programming language with a simple syntax. It also has a lot of great open-source libraries -- which, in Python, we call them `packages` -- developed by an actve community. If interested, [read more](https://www.cbtnuggets.com/blog/technology/data/why-data-scientists-love-python) on why Python is particularly so popular among data scientist.*

#### Some more `notebook` cool tricks:


- You can use the **`tab`** button for auto-completing. You can also use `tab` after a **`dot`** to be shown a drop-down list of the available attributes and methods on an object or a class. For instance, let's create a string object called `mystring` (by executing the first cell), and use the `tab` key after a `dot` to see a list of available attributes and methods for a string object in Python (in the second cell: don't run it, just use the tab key after dot!).

In [0]:
mystring = '  Hello, World!'

In [0]:
mystring.

> Note that in Python, every variables is an objects. 
>Also by now, you shoul have noticed that we don't need to explicitly declare variables and their types in Python before using them!

> Another thing to keep in mind is that objects and functions (runtime variables) persist between different cells in the same notebook session. If you want to clear the memroy, you can choose "`Reset all runtimes...`" under the "`Runtime`" tab from the top-left menue. 


- Another cool point about the `notebook` environment is that if you want to get help on anything (a method, a function, an object, etc), you can just put a question mark in front of it and execute that line. For instance, let's get help on what the `strip` method does on a string oject, by executing the folloing cell:

In [0]:
mystring.strip?
# This opens the help on the method "strip" at the bottom of this
# page. You can close the help page after reading it.
# By the way, this is how we designate a comment line in Python!

From now on, we may give a hint about answering a question by putting a question mark in front of a command to open the help page for it for you. Your task is to complete the command. 

In [0]:
print(mystring)
print(mystring.strip())

## 1.   Explore the Dataset -- using Python's `Pandas`

Here, we are going to perform some basic *exploratory data analysis* on some sample data. The data is about the "cars for sale", which was collected in Summer 2014 from the website _Autotrader.com_,  by one of our MSc student who worked on a data mining project. The data is saved in a CSV file, called **`LondonCars.csv`**.

---
> **Q0:** Why are such files called `CSV`? (Find out what CSV stands for. It may also help to try and open the fie using a basic text editor).


> **A0:** (*you can use this space to write your answer for your own note taking! Just double-click on this cell*)
---

#### 0.   Upload the CSV file to this machine.



>-   On the left side of this page, click on the arrow to "Open the left pane". 
-   From the "Files" tab, click on `UPLOAD`
- Choose the CSV file that you downloaded from QM+ to your local machine, to be uploaded to *this machine*.

>_Note:_ *By "this machine", we mean the virtual machine that is allocated to your account and running this notebook, hosted by Google Comptute Engine. Note that next time you log in, the file may be gone and needs to be re-uplaoded, as Google recycles Virtual machines when they are idle for a while.*



#### 1.   Load the dataset from the CSV file:



We will use a package in Python called [pandas](https://pandas.pydata.org/): it has many useful features to work with structured data, and is popualr for its ease of use (and as is the case for any Python package, it is [open-source](https://github.com/pandas-dev/pandas)).


So, let's start by reading the csv file into a `pandas' DataFrame`:

(Note that we use **`import`** to (guess what) import a package in Python! We can also assign a different name (alias) to it, usually a shorthand name for convenience. For example, we typically `import pandas as pd`, because we lazy!)

Execute the following code block. 



 


In [0]:
import pandas as pd
df = pd.read_csv('./LondonCars2014.csv')

Let's get more information about the `read_csv` method that we used by executing the folloing cell:

In [0]:
pd.read_csv?

---
> **Q1:** You should notice that the command has many options. When these options are not specified, their default values are taken. For example, find out what these default values were for our `read_csv`. Also find out, in particular, what the options `sep`, `header`, `index_col`, `usecols`, `dtype`, `na_values`, and `encoding` do.

> **A1:** 
---



#### 2.   Get general information about the data:



So far, we have read the csv file into a variable we called **`df`**: it is a pandas `DataFrame` object that contains the information in the csv file, along with many useful attributes and methods. For isntance, let's print the first few entries along with the column names, to get a quick feeling about the data:


In [0]:
df.head()

We still don't know if the data-types are read correctly, we didn't specify them. The `info` method gives us a summary information: 

In [0]:
df.info?

You should notice that the data-types are not exactly correct. For instance, the type for the `Doors` attribute (column) is infered as numeric (64-bit integer), but this is wrong (why?). 

So let's fix them! 

In [0]:
df = df.astype({'Make':'category', 'Model':'category', 'Year':'category', 
                'Mileage':'int32', 'Price':'int64', 'Body Style':'category', 
                'Ex Color':'category' , 'In Color':'category', 
                'Engine':'category', 'Transmission':'category', 'Doors':'category'})

In [0]:
Now, use the `info` method again to check if it it had the desied effect:

---
> **Q2:** You should also notice a change between the reported `memory ussage` (the last line reported by the `info` method). What is the change? How do you explain the change?

> **A2:** 
---




### 3. Basic exploratory questions using Python



---
> **Q3:** Use the provided hints to answer the questions about exploring your dataset.

> a. How many instances does the dataset have?

In [0]:
# hint: interpret the output of the following code:
print(df.shape)

> b. How many attributes?

In [0]:
# again, you can answer the question by interpretting the output of the same code:
print(df.shape)

> c. What are the attributes?

In [0]:
# a pandas dataframe has an an attribute called columns:
print(df.columns)

> d. What are the possible values for **Body Style** & **External Color**?

In [0]:
# the "unique" method helps: 
print('Possible body styles:')
print(df['Body Style'].unique())

# now your turn, for external colour:
print('\nPossible external colours:')
print('?')

> e. What is the *minimum*, *maximum*, *average* and *median* price?

In [0]:
# these are easy, to get the minimum:
print('min = {}'.format(df['Price'].min()))
# now, you do the rest:
print('max = {}'.format('?'))
print('mean = {}'.format('?'))
print('median = {}'.format('?'))


> f. Why might the median price be different than the average price?

> g. What is the most common year of car?

In [0]:
# hint: you can either use the 'mode()' method, 
# or the value_counts() along with idxmax()
print('?')

> f. What is the ratio of 2-door to 4-door cars? 

In [0]:
# Hint: you can use the output of the value_counts()...
print(df['Doors'].value_counts())
print('\nThe ratio of 2-door to 4-door cars is: {}'.format('?'))

> g. What is the average price of a Honda car versus a Mercedes-Benz car? **Hint:** Try $\texttt{df.loc}$.

In [0]:
# For Honda:
print('Average price of a Honda car = {:.2f}'.format(df.loc[df['Make'] == 'Honda']['Price'].mean()))
# Now you do for Mercedes-Benz:
print('Average price of a Mercedes-Benz car = {:.2f}'.format(0.00))


---

### 4. Basic data visualisation using Python


---
> **Q4:** Execute the following command and interpret what the plot is showing. In particular, describe the general trend between `milage` and `price`, as well as `milage` and `year`.

> **A4:**
---

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(df)

## 2.   Explore the Dataset &mdash; Weka



In this part we will explore the same data with Weka data mining software. This type of exploratory analysis is typical when you are faced with a new dataset that you need to understand before trying to solve a specific problem. 

>
#### 0. Start Weka Explorer.
>Note that Weka is installed in your own local machine. Ask for help if you cannot find it!
>
#### 1. Open the dataset: 
>Use the same dataset (`LondonCars.csv`) that you have downloaded from QM+ to your own machine.
>
>`Open → Select csv → LondonCars.csv`. 
>
>You can click `edit` to see the raw data in a spreadsheet style.
>

>---
>a. Again, find out how many instances, and attributes the dataset has?

>b. By clicking on an attribute in the `Attributes` panel, you can see its type, possible values (discrete), or statistics (continuous) in the `Selected
Attribute` panel.

>c. Which attributes are continuous (numeric) or discrete (nominal/categorical)?

>d. What are the possible body styles? Which is the most and least common body style?

>e. What are the most and least popular external body colors?

>f. What is the minimum, maximum and average values for mileage and price?

>g. Looking at the histogram for those attributes, do they look Gaussian (bell curve) distributed? Are there outliers?

>---
>
#### 2. If you use the `Class` pulldown in the `Selected Attribute` panel. You can color the histogram of one attribute according to another attribute.
>
>
#### 3. Select "Body style" attribute and "Body style" class. Note the color corresponding to each body type.
>>
>---
> a. Select `Make attribute`, keeping body style coloring. Which car company only makes "SUV" style cars?

> b. Which car company makes the most "Coupe" style cars?

>---

#### 4. The "Doors" attribute has been interpreted as numeric. 

>---
>a. How many unique doors are there? 

>---
#### 5. To use it as a class for coloring, we should convert it to nominal.
>Click `Filter → Unsupervised → NumericToNominal`. Click the box next to the filter and choose attribute index = 11 (index of doors) and press apply. Be careful not to apply to all attributes, or they will all become nominal.

> Click the doors attribute and observe the coloring of each door configuration.
>
#### 6. Now select body style attribute and color it by doors class.
>>
>---
> a. Which car types always come with 2 doors? Which always come with 4
doors?

> b. Which car types come in a mix of door configurations?

>c. For each class attribute, you can click `Visualize All` to see each attribute histogram in those colors. See what other patterns you can find.

>---


## 3.   Finding Correlations &mdash; Weka



#### 0. Switch to the <font color='blue'>visualization view</font> in Weka Explorer. 

>This can plot any pair of attributes against each other. Click on any panel to bring up a plot. On these plots every instance (record or row) in the database is shown as a point. If you click on any point it will bring up a window showing you the details of that instance (car).
>
>
#### 1. Select `X`: "Mileage", `Y`: "Price". What do you observe about the relation between the two?
>>
>---
> a. Is there any correlation? Is it linear?

>b. You can simultaneously color the plot by any other variable.

>c. What do you observe when coloring by price?

>d. When coloring by Engine size, what is the impact of 8, 6 or 4 cylinder engines?
>
>---
#### 2. Plot `X`: "External Color", against `Y`: "Price"
>
>---
>a. Which color looks like they get the highest valuation overall?

>b. Which color gets the lowest valuation overall? Use the `Jitter` slider to
slightly spread out all the points that are on top of each other.
>
>---
#### 3. Looking at the least valuable color:
>
>---
>a. How many cars with that color are shown?

>b. Click on a car (data point) of that category, what make is it?

>c. Is it safe to conclude that cars with that color are generally very likely to be cheap?

>---

## 4.   Making Predictions &mdash; Weka



> Lets try to build a car-price predictor. This would be useful for a used car business to know how much to offer to pay for a used car, and how much to sell it for.
>
>
#### 0. Switch back to the <font color='blue'>Preprocess view</font>. 

>Remove all attributes besides the numeric ones: "Year", "Mileage", "Price". (Click the nominal attributes and then press Remove). In later exercises we will get back to predictions using nominal inputs.
>
>
#### 1. Switch to <font color='blue'>classify view</font>.
>
>Under classifier. Choose `Classifiers → Functions → Linear Regression`.

>Make sure "Price" is selected as the target variable to predict in the drop-down box, and press start.
>
>
#### 2. Observe the results in the output panel:
>
>The regression model discovers the values of `A`, `B`, and `C` in an equation of the form $Price = A\times Year+B \times Mileage+C$. It can then use this equation to predict the price of a new car.

>---
>a. What is the value of every Year of car age?

>b. How much value does every mile of driving loose?

>c. The mean absolute error is the average difference between the predicted
price of each car and the true price. What is it in this case?

>d. Do you think this an acceptable level of prediction accuracy for a used car business? What could we do to improve the accuracy?

>---
