# Problem Statement 1:
You were recently hired as a business analyst in a top sports company. The senior management team has
asked you to come up with metrics with which they can gauge which team will win the upcoming La Liga Cup
(Football tournament). The given data set contains information on all the teams that have so far participated
in all the past tournaments. It has data about how many goals each team scored, conceded; how many times
they came within the first 6 positions, how many seasons they have qualified, their best position in the past,
etc.
Before doing any analysis it would be a good idea to check for any hyphens or other symbols in the data set
and make appropriate replacements to make sure you can perform arithmetic operations on the data. Prepare
a short report to answer the following questions:

1. Which are the teams which started playing between 1930 and 1980?
2. Which are the top 5 teams in terms of points?
3. What is the distribution of the winning percentage for all teams? Which teams are in the top 5 in
   terms of winning percentage? (Winning percentage= (GamesWon / GamesPlayed)\*100)
4. Is there a significant difference in the winning percentage for teams which have attained the best
   position between 1-3 and those teams which have had the best position between 4-7?


In [2]:
# importing libraries
import numpy as np
import pandas as pd

# Load the dataset into a dataframe named 'df'
df = pd.read_csv('laliga.csv')
df.head(5)

Unnamed: 0,Pos,Team,Seasons,Points,GamesPlayed,GamesWon,GamesDrawn,GamesLost,GoalsFor,GoalsAgainst,Champion,Runner-up,Third,Fourth,Fifth,Sixth,T,Debut,Since/LastApp,BestPosition
0,1,Real Madrid,86,4385,2762,1647,552,563,5947,3140,33,23,8,8,3,4,79,1929,1929,1
1,2,Barcelona,86,4262,2762,1581,573,608,5900,3114,25,25,12,12,4,6,83,1929,1929,1
2,3,Atletico Madrid,80,3442,2614,1241,598,775,4534,3309,10,8,16,9,7,6,56,1929,2002-03,1
3,4,Valencia,82,3386,2664,1187,616,861,4398,3469,6,6,10,11,10,7,50,1931-32,1987-88,1
4,5,Athletic Bilbao,86,3368,2762,1209,633,920,4631,3700,8,7,10,5,8,10,49,1929,1929,1


In [3]:
#checking shape of the 'laliga' dataframe
df.shape

(61, 20)

In [4]:
#checking data types of all columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Pos            61 non-null     int64 
 1   Team           61 non-null     object
 2   Seasons        61 non-null     int64 
 3   Points         61 non-null     object
 4   GamesPlayed    61 non-null     object
 5   GamesWon       61 non-null     object
 6   GamesDrawn     61 non-null     object
 7   GamesLost      61 non-null     object
 8   GoalsFor       61 non-null     object
 9   GoalsAgainst   61 non-null     object
 10  Champion       61 non-null     object
 11  Runner-up      61 non-null     object
 12  Third          61 non-null     object
 13  Fourth         61 non-null     object
 14  Fifth          61 non-null     object
 15  Sixth          61 non-null     object
 16  T              61 non-null     object
 17  Debut          61 non-null     object
 18  Since/LastApp  61 non-null     o

In [5]:
# using replace function to replace '-' with '0' which will allow us arithamatic operations
df.replace('-',0,inplace=True)
df

Unnamed: 0,Pos,Team,Seasons,Points,GamesPlayed,GamesWon,GamesDrawn,GamesLost,GoalsFor,GoalsAgainst,Champion,Runner-up,Third,Fourth,Fifth,Sixth,T,Debut,Since/LastApp,BestPosition
0,1,Real Madrid,86,4385,2762,1647,552,563,5947,3140,33,23,8,8,3,4,79,1929,1929,1
1,2,Barcelona,86,4262,2762,1581,573,608,5900,3114,25,25,12,12,4,6,83,1929,1929,1
2,3,Atletico Madrid,80,3442,2614,1241,598,775,4534,3309,10,8,16,9,7,6,56,1929,2002-03,1
3,4,Valencia,82,3386,2664,1187,616,861,4398,3469,6,6,10,11,10,7,50,1931-32,1987-88,1
4,5,Athletic Bilbao,86,3368,2762,1209,633,920,4631,3700,8,7,10,5,8,10,49,1929,1929,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,57,Xerez,1,34,38,8,10,20,38,66,0,0,0,0,0,0,0,2009-10,2009-10,20
57,58,Condal,1,22,30,7,8,15,37,57,0,0,0,0,0,0,0,1956-57,1956-57,16
58,59,Atletico Tetuan,1,19,30,7,5,18,51,85,0,0,0,0,0,0,0,1951-52,1951-52,16
59,60,Cultural Leonesa,1,14,30,5,4,21,34,65,0,0,0,0,0,0,0,1955-56,1955-56,15


1. Which are the teams which started playing between 1930 and 1980?


In [6]:
# converting values of 'Debut' column into string datatype
df['Debut'] = df['Debut'].astype(str)

# getting team details to new dataframe 'Debut Year' based on Debut in between 1930 to 1980 (including 1930 but excluding 1980)
Debut_Year = df[df['Debut'].str[:4].between('1930','1980')]

# printing team name and debut year from 'Debut_Year' dataframe
Debut_Year[['Team','Debut']].sort_values('Debut')

Unnamed: 0,Team,Debut
28,Alaves,1930-31
3,Valencia,1931-32
9,Real Betis,1932-33
17,Oviedo,1933-34
5,Sevilla,1934-35
25,Hercules,1935-36
15,Osasuna,1935-36
8,Zaragoza,1939-40
11,Celta Vigo,1939-40
27,Murcia,1940-41


2. Which are the top 5 teams in terms of points?


In [7]:
# Convert Points column to numeric
df['Points'] = pd.to_numeric(df['Points'])

# Sort the dataframe by Points in descending order and select the top 5 teams
top_5_teams = df.sort_values(by='Points', ascending=False).head(5)

# Display the top 5 teams
top_5_teams[['Team', 'Points']]

Unnamed: 0,Team,Points
0,Real Madrid,4385
1,Barcelona,4262
2,Atletico Madrid,3442
3,Valencia,3386
4,Athletic Bilbao,3368


3. What is the distribution of the winning percentage for all teams? Which teams are in the top 5 in
   terms of winning percentage? (Winning percentage= (GamesWon / GamesPlayed)\*100)


In [8]:
# Convert GamesPlayed and GamesWon columns to numeric
df['GamesPlayed'] = pd.to_numeric(df['GamesPlayed'])
df['GamesWon'] = pd.to_numeric(df['GamesWon'])

# Calculate the winning percentage
df['WinningPercentage'] = (df['GamesWon'] / df['GamesPlayed']) * 100

# Display the distribution of the winning percentage
winning_percentage_distribution = df['WinningPercentage'].describe()
print("winning_percentage_distribution:  " + str(winning_percentage_distribution))

# Find the top 5 teams in terms of winning percentage
top_5_winning_percentage_teams = df.sort_values(by='WinningPercentage', ascending=False).head(5)
top_5_winning_percentage_teams[['Team', 'WinningPercentage']]

winning_percentage_distribution:  count    60.000000
mean     31.364790
std       7.831199
min      16.666667
25%      27.607494
50%      30.491722
75%      33.540164
max      59.630702
Name: WinningPercentage, dtype: float64


Unnamed: 0,Team,WinningPercentage
0,Real Madrid,59.630702
1,Barcelona,57.24113
2,Atletico Madrid,47.475134
3,Valencia,44.557057
4,Athletic Bilbao,43.772629


4. Is there a significant difference in the winning percentage for teams which have attained the best
   position between 1-3 and those teams which have had the best position between 4-7?


In [9]:
from scipy.stats import ttest_ind

# Split the dataframe into two groups based on BestPosition
group_1_3 = df[df['BestPosition'].between(1, 3)]
group_4_7 = df[df['BestPosition'].between(4, 7)]

# Perform t-test
t_stat, p_value = ttest_ind(group_1_3['WinningPercentage'].dropna(), group_4_7['WinningPercentage'].dropna())

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Check if the p-value is less than 0.05 to determine significance
if p_value < 0.05:
    print("There is a significant difference in the winning percentage between the two groups.")
else:
    print("There is no significant difference in the winning percentage between the two groups.")

T-statistic: 4.992715339447283
P-value: 1.5362101870160497e-05
There is a significant difference in the winning percentage between the two groups.


# Problem Statement 2:
A study was done to measure the blood pressure of 60-year-old women with glaucoma. A random sample of
200 60-year-old women with glaucoma was chosen. The mean of the systolic blood pressure in the sample
was 140 mm Hg and the standard deviation was 25 mm Hg. Prepare a short report to answer the following
questions:

1. Calculate the estimated standard error of the sample mean. What does the standard error indicate?
2. Estimate a 95% confidence interval for the true mean blood pressure for all 60-year-old women with
   glaucoma.
3. Assume that instead of 200, a random sample of only 100 60-year-old women with glaucoma was
   chosen. The sample mean and standard deviation estimates are the same as those in the original
   study. What is the estimated 95% confidence interval for the true mean blood pressure?
4. Which of the two above intervals is wider?
5. Explain in non-technical terms why the estimated standard error of a sample mean tends to decrease
   with an increase in sample size.
   Proprietary content.


1. Calculate the estimated standard error of the sample mean. What does the standard error indicate?


In [10]:
import numpy as np

# Given values
sample_size = 200
sample_mean = 140
std_deviation = 25

# Calculate the standard error
standard_error = std_deviation / np.sqrt(sample_size)
print(f"Estimated Standard Error: {standard_error}")

# Explanation of what the standard error indicates
explanation = """
The standard error of the sample mean indicates how much the sample mean (140 mm Hg) is expected to vary from the true population mean.
A smaller standard error suggests that the sample mean is a more accurate estimate of the population mean.
"""
print(explanation)

Estimated Standard Error: 1.7677669529663687

The standard error of the sample mean indicates how much the sample mean (140 mm Hg) is expected to vary from the true population mean.
A smaller standard error suggests that the sample mean is a more accurate estimate of the population mean.



2. Estimate a 95% confidence interval for the true mean blood pressure for all 60-year-old women with
glaucoma.

In [11]:
from scipy.stats import norm

# Calculate the Z score for a 95% confidence interval
z_score = norm.ppf(0.975)

# Calculate the margin of error
margin_of_error = z_score * standard_error

# Calculate the confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error) # Upper bound , lower bound
print(f"95% Confidence Interval: {confidence_interval}")

95% Confidence Interval: (136.5352404391258, 143.4647595608742)


3. Assume that instead of 200, a random sample of only 100 60-year-old women with glaucoma was
chosen. The sample mean and standard deviation estimates are the same as those in the original
study. What is the estimated 95% confidence interval for the true mean blood pressure?

In [12]:
# New sample size
new_sample_size = 100

# Recalculate the standard error with the new sample size
new_standard_error = std_deviation / np.sqrt(new_sample_size)

# Calculate the margin of error with the new standard error
new_margin_of_error = z_score * new_standard_error

# Calculate the new confidence interval
new_confidence_interval = (sample_mean - new_margin_of_error, sample_mean + new_margin_of_error)
print(f"95% Confidence Interval with sample size of 100: {new_confidence_interval}")

95% Confidence Interval with sample size of 100: (135.10009003864985, 144.89990996135015)


4. Which of the two above intervals is wider?

In [13]:
# Calculate the width of the original confidence interval
original_interval_width = confidence_interval[1] - confidence_interval[0]

# Calculate the width of the new confidence interval
new_interval_width = new_confidence_interval[1] - new_confidence_interval[0]

# Compare the widths
if original_interval_width > new_interval_width:
    print("The original confidence interval is wider.")
else:
    print("The new confidence interval is wider.")

The new confidence interval is wider.


5. Explain in non-technical terms why the estimated standard error of a sample mean tends to decrease
with an increase in sample size.

# z_score = (value - mean) / std_dev
The Z score of 1.96 is commonly used in statistics for a 95% confidence interval. This value comes from the standard normal distribution (a normal distribution with a mean of 0 and a standard deviation of 1).

To find the Z score for a 95% confidence interval, you look for the value that leaves 2.5% in each tail of the distribution (since 95% is in the middle). This means you need to find the Z score that corresponds to the 97.5th percentile of the standard normal distribution.

In [14]:
from scipy.stats import norm

confidence_level = 0.95  # For 95% confidence interval
alpha = 1 - confidence_level  # Significance level
z_score = norm.ppf(1 - alpha / 2)  # Two-tailed test
Z_score = norm.ppf(1 - (1 - confidence_level) / 2)
print(f"{Z_score}")

1.959963984540054


In [15]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, norm

# Problem Statement 3:
# Par Inc., is a major manufacturer of golf equipment. Management believes that Par’s
# market share could be increased with the introduction of a cut-resistant, longer-lasting golf ball. Therefore,
# the research group at Par has been investigating a new golf ball coating designed to resist cuts and provide a
# more durable ball. The tests with the coating have been promising.
# One of the researchers voiced concern about the effect of the new coating on driving distances. Par would like
# the new cut-resistant ball to offer driving distances comparable to those of the current-model golf ball. To
# compare the driving distances for the two balls, 40 balls of both the new and current models were subjected
# to distance tests. The testing was performed with a mechanical hitting machine so that any difference
# between the mean distances for the two models could be attributed to a difference in the design. The results
# of the tests, with distances measured to the nearest yard, are contained in the data set “Golf”. Prepare a short
# report to answer the following questions:
# 1. Formulate and present the rationale for a hypothesis test that Par could use to compare the driving
# distances of the current and new golf balls.
# 2. Analyze the data to provide the hypothesis testing conclusion. What is the p-value for your test?
# What is your recommendation for Par Inc.?
# 3. What is the 95% confidence interval for the population mean of each model, and what is the 95%
# confidence interval for the difference between the means of the two populations?


# Load the dataset into a dataframe named 'golf_df'
golf_df = pd.read_csv('Golf.csv')

# Display the first few rows of the dataframe
golf_df.head()

Unnamed: 0,Current,New
0,264,277
1,261,269
2,267,263
3,272,266
4,258,262


In [16]:
from scipy.stats import ttest_ind

# Extract the driving distances for the current and new golf balls
current_distances = golf_df['Current']
new_distances = golf_df['New']

# Perform a two-sample t-test
t_stat, p_value = ttest_ind(current_distances, new_distances)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Check if the p-value is less than the significance level (0.05)
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in the mean driving distances between the current and new golf balls.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in the mean driving distances between the current and new golf balls.")

T-statistic: 1.3283615935245678
P-value: 0.18793228491854666
Fail to reject the null hypothesis: There is no significant difference in the mean driving distances between the current and new golf balls.


In [17]:
import pandas as pd
import scipy.stats as stats
import numpy as np

# Use the existing dataframe 'golf_df'
df = golf_df

# Use the existing series 'current_distances' and 'new_distances'
current_model = current_distances
new_model = new_distances

### 1. Hypothesis Formulation
# H0 (Null Hypothesis): There is no significant difference in mean driving distance (New = Current)
# H1 (Alternative Hypothesis): There is a significant difference in mean driving distance (New ≠ Current)

### 2. Perform Independent t-test
t_stat, p_value = stats.ttest_ind(new_model, current_model, equal_var=False)

# Decision Rule: If p-value < 0.05, reject H0 (significant difference)
if p_value < 0.05:
    recommendation = "Reject H0: There is a significant difference in driving distance."
else:
    recommendation = "Fail to reject H0: No significant difference in driving distance."

### 3. Compute 95% Confidence Intervals
confidence = 0.95
alpha = 1 - confidence

# Confidence interval for each model
current_mean, new_mean = np.mean(current_model), np.mean(new_model)
current_se, new_se = stats.sem(current_model), stats.sem(new_model)

current_ci = stats.t.interval(confidence, len(current_model)-1, loc=current_mean, scale=current_se)
new_ci = stats.t.interval(confidence, len(new_model)-1, loc=new_mean, scale=new_se)

# Confidence interval for the difference in means
mean_diff = new_mean - current_mean
se_diff = np.sqrt(current_se**2 + new_se**2)
diff_ci = stats.t.interval(confidence, len(current_model)+len(new_model)-2, loc=mean_diff, scale=se_diff)

### 4. Print Results
print(f"1. Hypothesis Test Results:")
print(f"   - t-statistic: {t_stat:.4f}")
print(f"   - p-value: {p_value:.4f}")
print(f"   - Conclusion: {recommendation}\n")

print(f"2. 95% Confidence Intervals:")
print(f"   - Current Model Mean CI: {current_ci}")
print(f"   - New Model Mean CI: {new_ci}")
print(f"   - Difference in Means CI: {diff_ci}")

1. Hypothesis Test Results:
   - t-statistic: -1.3284
   - p-value: 0.1880
   - Conclusion: Fail to reject H0: No significant difference in driving distance.

2. 95% Confidence Intervals:
   - Current Model Mean CI: (267.4756596416507, 273.0743403583493)
   - New Model Mean CI: (264.3348163973985, 270.6651836026015)
   - Difference in Means CI: (-6.933958406267855, 1.383958406267901)
