# Data Analysis

In this notebook, we will learn to use the Pandas library. Pandas is an open-source library for working with data. Because the library is open-source, you can view the source code that was used to build the library.

In [None]:
# use the 'import' command to use a library within a jupyter notebook
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Read a data set into your jupyter notebook
data = pd.read_csv('college-admissions.csv')

In [None]:
# Here is how you access documentation in a jupyter notebook
pd.read_csv?

In [None]:
# See the first "n" rows of data
data.head()

In [None]:
data.head(10)

In [None]:
# See the last "n" rows of data
data.tail()

----------------------------------------------------------------------------------------

**Practice.** How would you see the last 10 rows of data?

In [None]:
# Type Answer Here

---------------------------------------------------------------

In [None]:
# See the dimensions of the data
data.shape

In [None]:
# See detailed information about your data
data.info()

_______________________________

## Data Classifications

Data may be classified as qualitative or quantitative.

Qualitative data is categorical:
- Gender
- Religion
- Sports
- Grade level

Quantitative data is numerical:
- age
- height
- weight
- income

Understanding how data is classified is important because it determines what type of operations that you can perform on the data. 

______________________________

## College Admissions Data Set

As you saw, the college admissions data consists of 400 rows and 7 columns. Each row may be viewed as a single record that represents one student. Therefore, there are 400 students in our data set.

The columns may also be called "features"; taken together, these features describe the students in our data set. Your goal is to use these features to gather insights about the students that the data represents.

**Column Names**
- admit: takes the values (0,1) to indicate whether students where admitted to university or not.
- gre: graduate record exam scores
- gpa: grade point average
- ses: socioeconomic status where 1=low, 2=medium, 3=high
- gender_male: takes the values (0,1) to indicate whether students are male or not.
- race: 1=hispanic, 2=asian, 3=african-american
- rank: the prestige of the undergraduate university, where 1 has the highest prestige and 4 has the lowest.

___________________________________________________________________

In [None]:
# See summary statistics on the data columns
data.describe()

In [None]:
# Find the median
data['gre'].median()

In [None]:
# Find the standard deviation
data['gre'].std()

In [None]:
# Find the mode
data['admit'].mode()

In [None]:
# Find the unique values
data['race'].unique()

___________________________________________________________

**Practice.** Find the unique values for the 'socioeconomic status' variable.

In [None]:
# Type Answer Here

**Practice.** Find the unique values for the 'gender' variable.

In [None]:
# Type Answer Here

______________________________________________

In [None]:
# Create a frequency table
data['race'].value_counts()

In [None]:
# Plot the frequency table as a bar chart
data['race'].value_counts().plot(kind='bar');

______________________________________

**Practice.** Create a frequency table for 'admit' variable. 

In [None]:
# Type Answer Here

**Practice.** Plot the frequency table for 'admit' variable as a bar chart. What does the chart tell you? 

In [None]:
# Type Answer Here

_______________________________

In [None]:
# Create a histogram to show the distribution of gre scores
data['gre'].hist()

In [None]:
# What happens if I try to use a frequency table on qualitative data?
data['gre'].value_counts()

In [None]:
data['gre'].value_counts().plot(kind='bar')

In [None]:
#Watch this YouTube video that describes why we use histograms for quantitative data
from IPython.display import Audio,Image,YouTubeVideo
YouTubeVideo(id='qBigTkBLU6g', width=600, height=300)

_________________________________________________

In [None]:
# Using the groupby function
data.groupby('race')['gpa'].mean()

____________________________________________

**Practice.** What is the average gre scores by socioeconomic status for students in the data set?

In [None]:
# Type Answer Here

_____________________________________________

## Lab Assignment

Working in groups of 2-3 students, review the college admissions data set. Based on the features provided, write down 20 questions that you could ask and answer using this data set.