## Data Analysis

* In this section, you will learn how to approach data acquisition in various ways and obtain necessary insights from a dataset.

* By the end of this lab, you will successfully load the data into Jupyter Notebook and gain some fundamental insights via the Pandas Library.

* In our case, the Diabetes Dataset is an online source and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

#### Senario:

* **Context**: This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. 
    * The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.  
    * Several constraints were placed on the selection of these instances from a larger database.  
    * In particular, all patients here are females at least 21 years of age of Pima Indian heritage.

* **Content**: The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
* We have 768 rows and 9 columns. The first 8 columns represent the features and the last column represent the target/label.


In [None]:
# Import pandas library
import pandas as pd
# see the scenario in the jupyter file:


filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/diabetes.csv"

# For local execution, use requests library for synchronous download
import requests
import os

def download(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
    else:
        print(f"Failed to download {url}. Status code: {response.status_code}")

download(filename, "diabetes.csv")
df = pd.read_csv("diabetes.csv")

* After reading the dataset, we can use the 'dataframe.head(n)' method to check the top n rows of the
 dataframe, where n is an integer. Contrary to 'dataframe.head(n)', 'dataframe.tail(n)' will show you the bottom n rows of the dataframe.

In [None]:
# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
df.head(5)

# To view the dimensions of the dataframe, we use the .shape parameter.
print(df.shape)

* **Statistical Overview of dataset**

* This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [None]:
df.info()
df.describe()

* **Identify and handle missing values**
We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:

**.isnull()**

**.notnull()**

The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [None]:
missing_data = df.isnull()
missing_data.head(5) # "True" stands for missing value, while "False" stands for not missing value.

* **Count missing values in each column**

* Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  

* In the body of the for loop the method ".value_counts()"  counts the number of "True" values. 

In [None]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

<h3 id="*correct_data_format">Correct data format</h3>

<p>Check all data is in the correct format (int, float, text or other).</p>

In Pandas, we use

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>

Numerical variables should have type **'float'** or **'int'**.



In [None]:
df.type

* **Visulization**
* **Visualization** is one of the best way to get insights from the dataset. **Seaborn** and **Matplotlib** are two of Python's most powerful visualization libraries.

In [None]:
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns
labels= 'Not Diabetic','Diabetic'
plt.pie(df['Outcome'].value_counts(),labels=labels,autopct='%0.02f%%')
plt.legend()
plt.show()