# Chapter py_02b 
 Statistics for Data Science and Analytics<br>
by Peter C. Bruce, Peter Gedeck, Janet F. Dobbins

Publisher: Wiley; 1st edition (2024) <br>
<!-- ISBN-13: 978-3031075650 -->

(c) 2024 Peter C. Bruce, Peter Gedeck, Janet F. Dobbins

The code needs to be executed in sequence.

Python packages and Python itself change over time. This can cause warnings or errors. 
"Warnings" are for information only and can usually be ignored. 
"Errors" will stop execution and need to be fixed in order to get results. 

If you come across an issue with the code, please follow these steps

- Check the repository (https://gedeck.github.io/sdsa-code-solutions/) to see if the code has been upgraded. This might solve the problem.
- Report the problem using the issue tracker at https://github.com/gedeck/sdsa-code-solutions/issues
- Paste the error message into Google and see if someone else already found a solution

In [2]:
x = 123
if x < 0:
    print('x is negative')
elif x == 0:
    print('x is zero')
else:
    print('x is positive')

x is positive


In [3]:
x = -123
if x < 0:
    x = -x
print(f'Absolute value of x: {x}')

Absolute value of x: 123


In [4]:
x = 123
if x < 0:
    print('x is negative')
else:
    if x == 0:
        print('x is zero')
    else:
        print('x is positive')

x is positive


In [5]:
for x in [1, 2, 3, 4, 5]:
    print(x)

1
2
3
4
5


In [6]:
x = 1
while x <= 5:
    print(x)
    x += 1

1
2
3
4
5


In [7]:
for x in range(1, 11):
    if x % 2 == 0:
        continue
    print(x)

1
3
5
7
9


In [8]:
for x in [1, 2, 3, 4, 5]:
    if x == 3:
        break
    print(x)

1
2


In [9]:
numbers = [12, 8, 9, 10, 11, 13, 9, 11, 10, 12]
sum_of_numbers = 0
for x in numbers:
    sum_of_numbers += x
mean = sum_of_numbers / len(numbers)
print(f'Mean: {mean}')

Mean: 10.5


In [10]:
variance = 0
for x in numbers:
    variance += (x - mean) ** 2
variance /= len(numbers)
sd = variance ** 0.5
print(f'Variance: {variance}')
print(f'Standard deviation: {sd}')

Variance: 2.25
Standard deviation: 1.5


In [11]:
greater_than_mean = []
for x in numbers:
    if x > mean:
        greater_than_mean.append(x)
print(f'Numbers greater than mean: {greater_than_mean}')

Numbers greater than mean: [12, 11, 13, 11, 12]


In [12]:
squared_differences = [(x - mean) ** 2 for x in numbers]
variance = sum(squared_differences) / len(numbers)
print(f'Variance: {variance}, standard deviation: {variance ** 0.5}')

Variance: 2.25, standard deviation: 1.5


In [13]:
variance = sum((x - mean) ** 2 for x in numbers) / len(numbers)

In [14]:
greater_than_mean = [x for x in numbers if x > mean]

In [15]:
import numpy as np
numbers = list(range(1_000_000))
numbers_np = np.arange(1_000_000)

In [16]:
%%timeit -n 1 -r 5
sum_of_numbers = 0
for x in numbers:
    sum_of_numbers += x

12.3 ms ± 400 µs per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [17]:
%%timeit -n 1 -r 5
sum_of_numbers = sum(numbers)

3.13 ms ± 104 µs per loop (mean ± std. dev. of 5 runs, 1 loop each)


In [18]:
%%timeit -n 10 -r 5
sum_of_numbers = numbers_np.sum()

78.1 µs ± 7.06 µs per loop (mean ± std. dev. of 5 runs, 10 loops each)


In [19]:
import numpy as np
numbers = [1, 2, 3, 4, 5]
x = np.array(numbers)
print(x)

[1 2 3 4 5]


In [20]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([6, 7, 8, 9, 10])
print(x + y)
print(x * y)
print(np.sqrt((x - 3) ** 2))
print(x > 3)
print(x[x > 3])

[ 7  9 11 13 15]
[ 6 14 24 36 50]
[2. 1. 0. 1. 2.]
[False False False  True  True]
[4 5]


In [21]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
print(x.sum())
print(x.mean())
print(x.cumsum())

15
3.0
[ 1  3  6 10 15]


In [22]:
import pandas as pd
df = pd.read_csv("hospitalerrors.csv")
df.head()

Unnamed: 0,Control,Treatment
0,1,2
1,1,2
2,1,2
3,1,2
4,1,2


In [23]:
df.to_csv("data.csv", index=False)

In [24]:
# Accessing a column
control = df["Control"]
control = df.Control
# Accessing using row index and column names
control = df.loc[:, "Control"]
row = df.loc[0]  # or df.loc[0, :]
values = df.loc[4:10, "Treatment"]
value = df.loc[0, "Control"]
# Accessing data using row and column numbers
treatment = df.iloc[:, 1]
row = df.iloc[0, :]
value = df.iloc[10, 0]

In [25]:
# Adding or changing columns
df["Constant value"] = 1
df["Sequence"] = range(len(df))
df["NewColumn"] = df["Control"] + df["Treatment"]
df["NewColumn"] = df["NewColumn"] * 2
# Removing a column
df = df.drop(columns=["NewColumn"])
# Renaming columns
df = df.rename(columns={"Control": "ControlGroup", "Treatment": "TreatmentGroup"})
# Sorting the data
df = df.sort_values(by="ControlGroup")