# Understanding the problem and importing data

The first part of data analysis is to understand the problem, asking relevant questions, importing( or exporting or data) and finally performing some basic analysis on data before starting core data analysis processes

## The problem 

First part is to understand the problem that we want to solve and ask relevant questions   

For eg, we have a problem in which we have a car to sell for which we have to determine what is the best selling price for is, i.e., price that is neither too low nor too high. For this, the questions we can ask are whether there is any existing dataset for used car prices that we can use for our analysis? Also what features of the car affect the prices, like color, brand, horsepower, etc? To solve this problem we will use a [used car prices dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/autos/) located at `data/imports-85-data.csv` and the dataset description file is located at `data/imports-85-names.csv`.

## Understanding the data

Next we need to look into the dataset that we want to use and get a basic understanding of the data, like what it looks like, what columns it has, etc and look into the documentation of the data.

The dataset being used is an open, publically available dataset in csv format. The dataset itself does not contain the headers. We can look at the documentation file `imports-85-names.md` to get infromation about the headers, like their name and the range of values. For eg, `symboling` represents insurance risk value(+3 is most risky and -3 is least risky). Higher the risk level, higher is the insurance cost. `normalized-losses` meansis the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc), and represents the average loss per car per year. `price` is the attribute which is our target value or label, i.e., the value that we want to predict. Other variables will be used for labelling (predicting) the value.

## Importing and Exporting data in Python

`Pandas` package in python is a versatile one that can be used to read(import) as well as write(export) data to/from multiple datasources.

### Importing data 

This is the process of loading and reading data in Python from various resources. To read any data using `pandas`, there are 2 important factors to consider - 
1. Format - This is the way in which the data is encoded. This can be generally determined through the extension of the file, like csv, json, xslx, hdf, etc.
2. File path - This is the location where the dataset is located. This can be something like a location in the computer, internet URL, server location, etc.

In case of out dataset, we found that it is located at [this location in internet](http://archive.ics.uci.edu/ml/machine-learning-databases/autos/)(In case the data is not available at this location due to some reason, the same data is available at `data/imports-85-data.csv`). When we see the data, we observe that all the properties are separated by a `,` which means that the data is most probably in `csv` format. To read this data in pandas using `read_csv` method. The dataset does not have any headers. So we will set the same in the `read_csv` method.

In [1]:
import pandas as pd

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

df = pd.read_csv(url, header = None)

#### Printing the dataframe in Python

Once we have imported the data, the next step is to see the data that is imported to get a look and feel about it. Since printing complete dataset can take a lot of time and space for big datasets, we can use `df.head(n)` function to print the first `n` rows of the data frame or `df.tail(n)` to print the last `n` rows of the dataframe 

In [2]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


#### Adding headers

Since we have set `headers=None` in `read_csv`, pandas has assigned integers to the header names. We can define our own headers in the dataframe using `df.columns` attribute using the following way

In [4]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style","drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]

df.columns = headers

df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Now we can see that instead of having integers as headers, we now have the header names that we had assigned.

### Exporting data

Similarly to importing, we can export our data using pandas. We might want to do that once we have done some operations on our data and we want to save it. 

To save csv data, we can use the function `to_csv(path)` where `path` is the location where we want to save our data.

In [5]:
df.to_csv("data/automobile.csv")

## Starting analyzing data

We will use some pandas methods to explore data. We do it once the dataset is load(imported). Pandas has multiple built in functions that we can use for things such as 
* Identifying different Data types
* See how the Data is distributed
* Locate potential issues in data

### Data type insight

It is important to check the data types of the dataset imported due to the following reasons - 
* Since pandas automatically assigns the datatypes to the data imported, it can be the case where some particular data may be assigned incorrect datatype(for eg, int data may be assigned object). This is due to the way pandas work.
* This helps in identifying which columns are compatible with python functions. For eg, mathematical operations can be applied to numeric data only.