# Exploratory Data Analysis: Examining Relationships

### Q -> Q

_Using Python_

In [2]:
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(True)

In [3]:
data = pd.read_excel('files/height.xls')
data.head()

Unnamed: 0,gender,height,weight
0,0,72,155
1,0,67,145
2,0,65,125
3,1,67,120
4,1,63,105


In [4]:
data = data.sort_values(by='gender')
data.head()

Unnamed: 0,gender,height,weight
0,0,72,155
58,0,72,230
57,0,72,163
56,0,72,198
55,0,72,170


In [5]:
ht_vs_wt = [go.Scatter(x=data.height, y=data.weight, mode='markers')]
layout = go.Layout(
    title='Height vs Weight',
    xaxis={'title': 'Height(inches)'},
    yaxis={'title': 'Weight(lbs)'},
)

figure = go.Figure(data=ht_vs_wt, layout=layout)
iplot(figure)

##### Labelled Scatter plot

In [6]:
ht_vs_wt_male = go.Scatter(x=data.height[data.gender==1], y=data.weight[data.gender==1], mode='markers', name="Male")
ht_vs_wt_female = go.Scatter(x=data.height[data.gender==0], y=data.weight[data.gender==0], mode='markers', name="Female")

In [7]:
figure = go.Figure(data=[ht_vs_wt_female, ht_vs_wt_male], layout=layout)

In [8]:
iplot(figure)

### Q -> Q: Linear Relatioships

In [9]:
animals = pd.read_excel('files/animals.xls')
animals.head()

Unnamed: 0,animal,gestation,longevity
0,baboon,187,20
1,"bear, black",219,18
2,"bear, grizzly",225,25
3,"bear, polar",240,20
4,beaver,122,5


Correlation is only a measure of linear relationship, so first we plot a graph to check for the type of relationship between gestation period and longetivity

In [10]:
data = [go.Scatter(x=animals.gestation, y=animals.longevity, text=animals.animal, mode='markers')]
layout = go.Layout(
    xaxis={'title': 'Average Longevity of Species (years)'},
    yaxis={'title': 'Average Gestation Period of Species (days)'}
)
iplot(go.Figure(data=data, layout=layout))

In [11]:
r = animals.longevity.corr(animals.gestation)
print(f"Correlation Coefficient = {r}")

Correlation Coefficient = 0.6632396748585047


The outlier(elephant) may be affecting the correlation in some way. Let us check how

In [12]:
r = animals.longevity[animals.animal != 'elephant'].corr(animals.gestation[animals.animal != 'elephant'])
print(f"Correlation Coefficient without elephant = {r}")

Correlation Coefficient without elephant = 0.5190389111466761


### Q -> Q Linear Regression

In [13]:
from sklearn import linear_model
from scipy.stats import linregress

In [14]:
olympics = pd.read_excel('files/olympics_2012.xls')
olympics.head()

Unnamed: 0,Year,Time
0,1896,273.2
1,1900,246.0
2,1904,245.4
3,1908,243.4
4,1912,236.8


In [15]:
year_vs_time = go.Scatter(x=olympics.Year, y=olympics.Time, name='Data', mode='markers')
layout = go.Layout(xaxis=go.XAxis(title='Year of Olympic Games'), 
                   yaxis=go.YAxis(title='Winning Time of 1500m Race (secs)'))

iplot(go.Figure(data=[year_vs_time], layout=layout))

In [16]:
least_sq_reg = linregress(olympics.Year, olympics.Time)

In [17]:
line = olympics.Year * least_sq_reg.slope + least_sq_reg.intercept

In [18]:
reg_line = go.Scatter(x=olympics.Year, y=line, name="Regression Line")

In [19]:
iplot(go.Figure(data=[year_vs_time, reg_line], layout=layout))

In [20]:
print(f"Equation of regression line is Y = {least_sq_reg.slope:.2f} * X + {least_sq_reg.intercept:.2f}")

Equation of regression line is Y = -0.35 * X + 916.43


We see that there is an outlier which is the year 1896. Lets check the effect of the outlier on the relationship

In [21]:
olympics_from_1900 = olympics[olympics.Year >= 1900]
olympics_from_1900.head()

Unnamed: 0,Year,Time
1,1900,246.0
2,1904,245.4
3,1908,243.4
4,1912,236.8
5,1920,241.8


In [22]:
trace_xy = go.Scatter(x=olympics_from_1900.Year, y=olympics_from_1900.Time, mode='markers', name="Data")
iplot(go.Figure(data=[trace_xy], layout=layout))

In [23]:
lm = linear_model.LinearRegression()
lm.fit(olympics_from_1900.Year.reshape([-1, 1]), olympics_from_1900.Time)
slope, intercept = lm.coef_[0], lm.intercept_


reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead


internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.



In [24]:
print(f"Equation of regression line is Y = {slope:.2f} * X + {intercept:.2f}")

Equation of regression line is Y = -0.30 * X + 811.54


In [25]:
line = slope*olympics_from_1900.Year + intercept
reg_line = go.Scatter(x=olympics_from_1900.Year, y=line, name="Regression Line")
iplot(go.Figure(data=[trace_xy, reg_line], layout=layout))

In [26]:
print(f"Prediction for 1,500 meter time in the 2016 Olympic Games in Rio de Janeiro is {slope*2016 + intercept:.2f}secs")

Prediction for 1,500 meter time in the 2016 Olympic Games in Rio de Janeiro is 207.25secs


## StatTutor

A student survey was conducted at a major university. Data were collected from a random sample of 239 undergraduate students, and the information that was collected included physical characteristics (such as height, handedness, etc.), study habits, academic performance and attitudes, and social behaviors. In this exercise, we will focus on exploring relationships between some of those variables. Note that the symbol * in the worksheet means that this observation is not available (this is known as a 'missing value').

**Gender:** Male or Female  
**Height:** Self-reported height (in inches)  
**GPA:** Student's cumulative college GPA  
**HS_GPA:** Student's high school GPA (senior year)  
**Seat:** Typical classroom seat location (F = Front, M = Middle, B = Back)  
**WtFeel:** Does the student feel that he/she is: Underweight, About Right, Overweight  
**Cheat:** Would the tell the instructor if he/she saw somebody cheating on exam? (No or Yes)  

### 1. Understanding the Problem: Check Data Format

In [30]:
students = pd.read_excel('files/body_image.xls')
students = students.replace({'*': None})
students.head(10)

Unnamed: 0,Gender,Height,GPA,HS GPA,Seat,WtFeel,Cheat
0,Female,64.0,2.6,2.63,M,AboutRt,No
1,Male,69.0,2.7,3.72,M,AboutRt,No
2,Female,66.0,3.0,3.44,F,AboutRt,No
3,Female,63.0,3.11,2.73,F,AboutRt,No
4,Male,72.0,3.4,2.35,B,OverWt,No
5,Female,67.0,3.43,3.84,M,AboutRt,No
6,Male,69.0,3.7,4.0,F,,No
7,Male,74.0,3.7,3.92,B,AboutRt,No
8,Male,72.0,3.77,3.09,M,UnderWt,No
9,Female,63.0,3.5,4.0,F,AboutRt,No


### Q1. Is there a relationship between students' college GPAs and their high school GPAs?

#### 1. Reflect on Question
In this first step, we think about the question and use our intuition and/or experience to try and predict what the results will show. Later, we will compare what we initially thought to what we actually find when we analyze the data.

>Yes. Generally students performing well will continue to perform well in college. There might be some dip in GPA maybe due to distractions in the college

#### 2. Analyze Data

##### a. Plan Analyses

Before choosing the appropriate analyses, it is helpful to identify the relevant variables, which for this question are:
- GPA
- HS_GPA

The variable **GPA** is *independent (response)* variable and is *quantitative*.  
The variable **HS GPA** is *dependent (explanatory)* variable and is *quantitative*

##### b. Exploratory Analysis 
- A meaningful display for this question is *Scatterplot*.
- A meaningful numerical summary to supplement the above display is *Correlation (if relationship is linear)
- Using this display and numerical summary, I will describe relationship between two quantitative variables

*Scatterplot*

In [32]:
trace_1 = go.Scatter(x=students['HS GPA'], y=students['GPA'], mode='markers', name='Data')
layout = go.Layout(
    title='College GPA vs High School GPA',
    xaxis={'title': 'High School GPA'},
    yaxis={'title': 'College GPA'}
)

In [33]:
iplot(go.Figure(data=[trace_1], layout=layout))

In [36]:
r_gpa = students['HS GPA'].corr(students['GPA'])
print("Correlation between High School GPA and College GPA is", r_gpa)

Correlation between High School GPA and College GPA is 0.715547363422


In [45]:
mask = ~students['HS GPA'].isna() & ~students['GPA'].isna()

In [47]:
lm = linregress(x=students['HS GPA'][mask], y=students['GPA'][mask])
slope, intercept = lm.slope, lm.intercept
gpa_pred = slope * students['HS GPA'] + intercept

In [49]:
trace_2 = go.Scatter(x=students['HS GPA'], y=gpa_pred, name="Regression Line")

iplot(go.Figure(data=[trace_1, trace_2], layout=layout))

In [51]:
print(f"What is the regression equation? College GPA = {slope:.2f} * High School GPA + {intercept:.2f}")
print(f"Does the line fit the pattern of the data well? Moderately well")
print(f"What is college GPA of a high school senior whose GPA is 3.45? {slope * 3.45 + intercept:.2f}")

What is the regression equation? College GPA = 0.62 * High School GPA + 1.07
Does the line fit the pattern of the data well? Moderately well
What is college GPA of a high school senior whose GPA is 3.45? 3.21


**Reporting Results:**  
The scatterplot displays a positive linear relationship between HS GPA and college GPA. The correlation coefficient r is 0.716, indicating that the positive linear relationship is moderately strong.

#### 3. Conclusion

- The results are consistent with what can be expected.
- We should intervene and counsel students while still in high school. Knowing that students with low GPAs are at risk of not doing well in college, colleges should develop programs, such as peer or faculty mentors, for these students.

### Q2. Are there differences between males and females with respect to body image?

#### 1. Reflect on Question
In this first step, we think about the question and use our intuition and/or experience to try and predict what the results will show. Later, we will compare what we initially thought to what we actually find when we analyze the data.

>Yes. I think females are more concerned about their body image than men based on my personal experience.  
> Ideally, There is no right or wrong answer here. In the past, body image was a problem associated mostly with females. These days, however, this is no longer the case.

#### 2. Analyze Data

##### a. Plan Analyses

Before choosing the appropriate analyses, it is helpful to identify the relevant variables, which for this question are:
- Gender
- WtFeel

The variable **Gender** is *dependent (explanatory)* variable and is *categorical*.  
The variable **WtFeel** is *independent (response)*  variable and is *categorical*

##### b. Exploratory Analysis 
- A meaningful display for this question is *Two-way Table*.
- A meaningful numerical summary to supplement the above display is *conditional probabilites*
- Using this display and numerical summary, I will examine relationship between two categorical variables

In [58]:
two_way_table = pd.crosstab(index=students.Gender, columns=students.WtFeel, margins=True, margins_name='Total')
two_way_table

WtFeel,AboutRt,OverWt,UnderWt,Total
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,107,32,6,145
Male,56,15,13,84
Total,163,47,19,229


In [63]:
two_way_table.div(two_way_table['Total'], axis=0) * 100

WtFeel,AboutRt,OverWt,UnderWt,Total
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,73.793103,22.068966,4.137931,100.0
Male,66.666667,17.857143,15.47619,100.0
Total,71.179039,20.524017,8.296943,100.0


#### 3. Conclusion

The results indicate that males and females do differ with respect to their body image. A larger proportion of females felt that their weight was about right (roughly 74% vs. 67% for males). Among the students who did not feel that their weight was about right, there was a difference between males and females. Roughly the same proportion of male students were concerned about being overweight as being underweight (18% vs. 15.5%), while females were mostly concerned about being overweight (22% vs. only 4% for underweight).

It is interesting to find that actually a larger proportion of males are concerned with their body image compared to females. Also, it makes sense that being underweight is more of a concern for males than it is for females because of the current quest of many males for a big, muscular body and the quest for females to emulate thin fashion models.

*It would be useful to report these results to people who counsel students, since concerns about body image can result in poor college performance and health issues.*

### Q3. Is students' academic performance in college related to their typical seating location in class?

#### 1. Reflect on Question
In this first step, we think about the question and use our intuition and/or experience to try and predict what the results will show. Later, we will compare what we initially thought to what we actually find when we analyze the data.

>One would expect that generally, students who sit in the front or middle of the classroom are the more conscientious students and therefore are also students with higher GPAs. This is of course, a gross generalization. It will be interesting to see whether the data will support this.


#### 2. Analyze Data

##### a. Plan Analyses

Before choosing the appropriate analyses, it is helpful to identify the relevant variables, which for this question are:
- GPA
- Seat

The variable **GPA** is *independent (response)* variable and is *quantitative*.  
The variable **Seat** is *dependent (explanatory)* variable and is *categorical*.

##### b. Exploratory Analysis 
- A meaningful display for this question is *Side-by-Side Boxplots*.
- A meaningful numerical summary to supplement the above display is *descriptive statistics*
- Using this display and numerical summary, I will compare distribution of a quantitative variable across several groups

*Side-by-Side Boxplot*

In [64]:
students.Seat.unique()

array(['M', 'F', 'B'], dtype=object)

In [75]:
gpa_F = go.Box(y=students.GPA[students.Seat == 'F'], name='Front')
gpa_M = go.Box(y=students.GPA[students.Seat == 'M'], name='Middle')
gpa_B = go.Box(y=students.GPA[students.Seat == 'B'], name='Back')

In [76]:
layout = go.Layout(
    title='Seating Arragement',
    yaxis={'title': 'GPA'}
)

In [77]:
iplot(go.Figure(data=[gpa_F, gpa_M, gpa_B], layout=layout))

In [74]:
students.groupby('Seat').GPA.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Seat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
B,46.0,2.974348,0.493296,2.0,2.6775,3.0,3.2375,4.06
F,51.0,3.251098,0.567237,1.92,3.0,3.33,3.7,4.1
M,131.0,3.118931,0.526408,1.91,2.78,3.0,3.505,4.38


**Reporting Results:**  
When comparing distributions across groups, we need to address the issues of center, spread and outliers:

**Center:** The median GPA of the "F group" (3.33) is higher than that of the two other groups (both of which are 3.0).

**Spread:** Differences in spread between the three groups are not huge, but still exist. The "M group" has the largest spread (range = 2.47, IQR = .74) followed by the "F group" (range = 2.18, IQR = .7), and then the "B group" (range = 2.06, IQR = .58).

**Outliers:** There are no outliers.

#### 3. Conclusion

The data suggest that GPA is somewhat related to seating location. In general, the GPAs of students who sit in the front of the classroom are slightly higher than those of students who sit in the middle or in the back. However, there is a lot of variation in GPA within each of the three groups, and therefore the student's typical seating location should not really be used as an indication of his/her performance in college.


The conclusions support what we would naturally expect. It was interesting to find hardly any differences in GPAs between students who sit in the middle of the classroom and those who sit in the back.

If instructors were made aware of these results, they could encourage students who are doing poorly to change their seat location. Also, students should be made aware that where they sit might impact their performance.