# Exercise 4: Mathematical and statistical programming with NumPy

### Given:
```
# import necessary libraries 
import numpy as np
# create data to work with 
x = np.arange(10)
```
1. Display the first five elements
2. Display every other element
3. Display the elements from index 5 in reverse order

In [3]:
import numpy as np

# create data to work with
x = np.arange(10)

# 1. Display the first five elements
print("First five elements:", x[:5])

# 2. Display every other element
print("Every other element:", x[::2])

# 3. Display the elements from index 5 in reverse order
print("Elements from index 5 in reverse order:", x[5::-1])


First five elements: [0 1 2 3 4]
Every other element: [0 2 4 6 8]
Elements from index 5 in reverse order: [5 4 3 2 1 0]


### Here's a brief explanation of each action:

- Display the first five elements:
This is achieved by slicing the array x using ```x[:5]```, which returns the elements from index 0 to index 4.
- Display every other element:
Using slicing with a step of 2, ```x[::2]``` returns every other element in the array.
- Display the elements from index 5 in reverse order:
This is done by slicing from index 5 backwards to the beginning of the array, using ```x[5::-1]```.

### two dimensional array 
```y = np.array([[12, 5, 2, 4],[ 7, 6, 8, 8],[ 1, 6, 7, 7]])```

4. Display the first 2 rows and the first 3 columns.
5. Display the first column of y.
6. Display the first row of y.
7. Read the contents of file cdc_1.csv, containing heights, weights and ages, into array data.
8. Calculate the minimum, maximum, and mean height, weight, and age.

In [34]:
import numpy as np

# Define the array y
y = np.array([[12, 5, 2, 4], [7, 6, 8, 8], [1, 6, 7, 7]])

# 4. Display the first 2 rows and the first 3 columns
print("First 2 rows and first 3 columns:")
print(y[:2, :3])

# 5. Display the first column of y
print("First column of y:")
print(y[:, 0])

# 6. Display the first row of y
print("First row of y:")
print(y[0, :])

# 7. Read the contents of file cdc_1.csv, containing heights, weights, and ages, into array data
data = np.genfromtxt(r'C:\Users\mike\QADHPython\cdc.csv', delimiter=',', skip_header=1, usecols=(5, 6, 7))

# 8. Calculate the minimum, maximum, and mean height, weight, and age
heights = data[:, 0]
weights = data[:,1]
ages = data[:, 2]

min_height = np.min(heights)
max_height = np.max(heights)
mean_height = np.mean(heights)

min_weight = np.min(weights)
max_weight = np.max(weights)
mean_weight = np.mean(weights)

min_age = np.min(ages)
max_age = np.max(ages)
mean_age = np.mean(ages)

print("Minimum height:", min_height)
print("Maximum height:", max_height)
print("Mean height:", mean_height)

print("Minimum weight:", min_weight)
print("Maximum weight:", max_weight)
print("Mean weight:", mean_weight)

print("Minimum age:", min_age)
print("Maximum age:", max_age)
print("Mean age:", mean_age)


First 2 rows and first 3 columns:
[[12  5  2]
 [ 7  6  8]]
First column of y:
[12  7  1]
First row of y:
[12  5  2  4]
Minimum height: 48.0
Maximum height: 93.0
Mean height: 67.1829
Minimum weight: 68.0
Maximum weight: 500.0
Mean weight: 169.68295
Minimum age: 68.0
Maximum age: 680.0
Mean age: 155.09385


# On genfromtxt
```numpy.genfromtxt()``` is a versatile function in NumPy for reading data from text files, including CSV files. Here's how you can use it to read the `height`, `weight`, and `wtdesired` columns into NumPy arrays.

Assuming your CSV file is named `data.csv` and located in the same directory as your Python script, and the columns are delimited by commas, here's an example:

```python
import numpy as np

# Load the data from the CSV file
data = np.genfromtxt('health_data.csv', delimiter=',', skip_header=1, usecols=(5, 6, 7))

# Assuming the columns are 0-indexed, (5, 6 & 7) represents the indices of height, weight, and wtdesire columns

# Extract height, weight, and wtdesire into separate arrays
heights = data[:, 0]
weights = data[:, 1]
wtdesired = data[:, 2]

# Print the first few elements of each array to verify
print("Heights:", heights[:5])
print("Weights:", weights[:5])
print("Weight Desired:", wtdesired[:5])
```

In this example:

- `np.genfromtxt()` reads data from the CSV file `health_data.csv`.
- `delimiter=','` specifies that columns are separated by commas.
- `skip_header=1` indicates that the first row contains headers and should be skipped.
- `usecols=(5, 6, 7)` specifies that only columns 5, 6 and 7 (0-indexed) should be read, which correspond to `height`, `weight`, and `wtdesire`.
- The resulting `data` array contains height, weight & wtdesire  from the CSV file.
- We then use slicing to extract each column into separate arrays: `heights`, `weights`, and `wtdesired`.

You can adjust the `usecols` parameter based on the actual column indices in your CSV file. Additionally, you might want to handle missing or malformed data using the `missing_values` and `filling_values` parameters if your data contains any such instances.

## Slicing & Dicing
Further, in NumPy, when you use indexing with a colon `:` in a specific dimension of an array, it means that you want to select all elements along that dimension. Let me break down the indexing used in your code:

```python
heights = data[:, 0]
```

In this line, `data[:, 0]` is indexing the `data` array. The `:` in the first position means "select all rows", and `0` in the second position means "select the element at index 0 in each row". So, `data[:, 0]` selects all rows from the `data` array and only the element at index 0 (the first column) in each row. This effectively extracts the `heights` column from the `data` array.

Similarly,

```python
weights = data[:, 1]
```

selects all rows from the `data` array and only the element at index 1 (the second column) in each row, effectively extracting the `weights` column.

And,

```python
wtdesired = data[:, 2]
```

selects all rows from the `data` array and only the element at index 2 (the third column) in each row, effectively extracting the `wtdesired` column.

So, in summary:

- `data[:, 0]` selects all rows and only the first column (index 0).
- `data[:, 1]` selects all rows and only the second column (index 1).
- `data[:, 2]` selects all rows and only the third column (index 2).

This way, you can use NumPy's slicing and indexing capabilities to extract specific columns or subsets of your data array for further analysis or processing.

# Descriptive statistics with NumPy
#import necessary libraries
```python
import numpy as np
import scipy.stats
```
1. Read the contents of file cdc_nan.csv, containing heights, weights and ages, into array data.
Separate the heights (column 0) and the weights (column 1).



In [44]:
import numpy as np
import scipy.stats

Let's go through each step:

1. Calculate the median for the heights and the weights and assign the values to variables.
2. Check if the arrays contain missing values.
3. Find the positions of missing values (NaNs) in the weights array.
4. Calculate the median for the weights, ignoring the NaN values.

Here's how you can do it:

```python
import numpy as np

# Assuming heights and weights arrays are already populated

# 1. Calculate the median for the heights and weights and assign the values to variables
height_median = np.median(heights)
weight_median = np.median(weights)

# 2. Check if the arrays contain missing values
heights_has_nan = np.isnan(heights).any()
weights_has_nan = np.isnan(weights).any()

# 3. Find the positions of missing values (NaNs) in the weights array
nan_indices = np.where(np.isnan(weights))

# 4. Calculate the median for the weights, ignoring the NaN values
weights_without_nan = weights[~np.isnan(weights)]
weight_median_without_nan = np.median(weights_without_nan)

# Printing the results
print("Median height:", height_median)
print("Median weight:", weight_median)
print("Heights contain NaNs:", heights_has_nan)
print("Weights contain NaNs:", weights_has_nan)
print("Positions of NaNs in weights:", nan_indices)
print("Median weight without NaNs:", weight_median_without_nan)
```

This code will perform the requested tasks:

- Calculate the median for both heights and weights.
- Check if there are any missing values (NaNs) in the arrays.
- Find the positions of NaNs in the weights array.
- Calculate the median for weights, ignoring the NaN values.

## Let us go through each calculation step by step:

1. Mean: Calculate the mean values for heights and weights.
2. Range: Calculate the range of the heights and the weights.
3. Interquartile Range (IQR): Find the quartiles for the heights and the weights, and the Interquartile Range (IQR).
4. Variance: Calculate population and sample variance for heights and weights.
5. Standard deviation: Calculate population and sample standard deviation for heights and weights.
6. Covariance: Find the covariance between heights and weights.
7. Correlation: Find the correlation between heights and weights.

Here's how you can implement these calculations:

```python
import numpy as np

# Assuming heights and weights arrays are already populated

# 1. Mean
height_mean = np.mean(heights)
weight_mean = np.mean(weights)

# 2. Range
height_range = np.ptp(heights)
weight_range = np.ptp(weights)

# 3. Interquartile Range (IQR)
height_quartiles = np.percentile(heights, [25, 75])
height_iqr = height_quartiles[1] - height_quartiles[0]

weight_quartiles = np.percentile(weights, [25, 75])
weight_iqr = weight_quartiles[1] - weight_quartiles[0]

# 4. Variance
height_pop_variance = np.var(heights)
height_sample_variance = np.var(heights, ddof=1)  # Use ddof=1 for sample variance

weight_pop_variance = np.var(weights)
weight_sample_variance = np.var(weights, ddof=1)

# 5. Standard deviation
height_pop_std = np.std(heights)
height_sample_std = np.std(heights, ddof=1)  # Use ddof=1 for sample standard deviation

weight_pop_std = np.std(weights)
weight_sample_std = np.std(weights, ddof=1)

# 6. Covariance
covariance = np.cov(heights, weights, ddof=0)[0, 1]

# 7. Correlation
correlation = np.corrcoef(heights, weights)[0, 1]

# Print the results
print("Mean height:", height_mean)
print("Mean weight:", weight_mean)
print("Height range:", height_range)
print("Weight range:", weight_range)
print("Height IQR:", height_iqr)
print("Weight IQR:", weight_iqr)
print("Population variance of height:", height_pop_variance)
print("Sample variance of height:", height_sample_variance)
print("Population variance of weight:", weight_pop_variance)
print("Sample variance of weight:", weight_sample_variance)
print("Population standard deviation of height:", height_pop_std)
print("Sample standard deviation of height:", height_sample_std)
print("Population standard deviation of weight:", weight_pop_std)
print("Sample standard deviation of weight:", weight_sample_std)
print("Covariance between height and weight:", covariance)
print("Correlation between height and weight:", correlation)
```

This code will calculate all the requested statistics for heights and weights and print the results.

Let's explain each of the key statistical terms used in the context of analyzing heights and weights data:

1. **Mean**:
   - The mean, also known as the average, is a measure of central tendency. It is calculated by summing up all the values in a dataset and dividing by the number of values.
   - In this context, the mean height and mean weight represent the average height and weight of the individuals in the dataset.

2. **Range**:
   - The range is a measure of the spread of a dataset. It is calculated as the difference between the maximum and minimum values in the dataset.
   - In this context, the height range and weight range represent the difference between the tallest and shortest individuals' heights and weights.

3. **Interquartile Range (IQR)**:
   - The interquartile range (IQR) is a measure of statistical dispersion. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1).
   - In this context, the height IQR and weight IQR represent the range containing the middle 50% of the heights and weights, respectively.

4. **Variance**:
   - Variance measures the variability or spread of a dataset. It quantifies how far the data points are from the mean.
   - Population variance is calculated by averaging the squared differences between each data point and the population mean.
   - Sample variance is similar, but it uses \(n - 1\) in the denominator (where \(n\) is the number of data points) to provide an unbiased estimate of the population variance.
   - In this context, the population and sample variances of height and weight quantify the variability in the heights and weights of individuals, respectively.

5. **Bivariate Covariance**:
   - Covariance measures the extent to which two variables change together. It indicates the direction of the linear relationship between two variables.
   - Bivariate covariance between two variables \(X\) and \(Y\) is calculated as the average of the product of the differences of each variable from their respective means.
   - In this context, the covariance between height and weight indicates how height and weight tend to vary together. A positive covariance suggests that as height increases, weight tends to increase, and vice versa.

6. **Correlation**:
   - Correlation measures the strength and direction of the linear relationship between two variables.
   - The correlation coefficient ranges from -1 to +1, where:
     - \(+1\) indicates a perfect positive linear relationship,
     - \(-1\) indicates a perfect negative linear relationship, and
     - \(0\) indicates no linear relationship.
   - In this context, the correlation between height and weight quantifies how closely related these two variables are. A positive correlation indicates that as height increases, weight tends to increase, and vice versa. The strength of the correlation can be interpreted based on its magnitude. A correlation close to \(+1\) or \(-1\) indicates a strong linear relationship, while a correlation close to \(0\) indicates a weak linear relationship.

# mode

In statistics, the mode refers to the value or values that occur most frequently in a dataset. It is a measure of central tendency, just like the mean and median.

The mode can be particularly useful when dealing with categorical or discrete data, such as the outcome of a survey where respondents select options from a list, or in scenarios where the data is not normally distributed and there might be multiple peaks or clusters.

For example, in the dataset `[1, 4, 4, 5, 5, 8, 2, 7, 7, 7]`, the mode is `7` because it appears more frequently than any other value.

Now, regarding the provided code:

```python
xx = [1, 4, 4, 5, 5, 8, 2, 7, 7, 7]
xxx = [1, 4, 4, 5, 5, 8, 2, 7, 7]
yy = np.array(xx)
yyy = np.array(xxx)
```

The code defines two Python lists, `xx` and `xxx`, with the same values. Then, it converts both lists into NumPy arrays, `yy` and `yyy`, respectively.

The code itself seems fine in terms of syntax and logic, but it doesn\'t directly relate to calculating the mode of the data. If you want to find the mode of a dataset using NumPy, you can use the `scipy.stats.mode()` function, which returns the mode(s) along with their counts.

For example:

```python
import numpy as np
from scipy.stats import mode

data = [1, 4, 4, 5, 5, 8, 2, 7, 7, 7]
mode_result = mode(data)

print("Mode(s):", mode_result.mode)
print("Count(s):", mode_result.count)
```

This code would output:

```
Mode(s): [7]
Count(s): [3]
```

Indicating that the mode of the dataset is `7`, and it appears `3` times.