<a href="https://colab.research.google.com/github/davidofitaly/notes_02_50_key_stats_ds/blob/main/01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
# Link do pobrania pliku CSV z Google Drive
url <- "https://drive.google.com/uc?export=download&id=1dah4LkSZoas-XedmpyJIWBb5R60VkqvM"

# Wczytanie pliku z URL
df <- read.csv(url)

# Sprawdzenie pierwszych kilku wierszy danych
head(df)




Unnamed: 0_level_0,State,Population,Murder.Rate,Abbreviation
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<chr>
1,Alabama,4779736,5.7,AL
2,Alaska,710231,5.6,AK
3,Arizona,6392017,4.7,AZ
4,Arkansas,2915918,5.6,AR
5,California,37253956,4.4,CA
6,Colorado,5029196,2.8,CO


In [None]:
from google.colab import drive
drive.mount('/content/drive')

###Ordered Data Elements  

#####In data analysis, **ordered data** refers to elements that can be arranged in a meaningful sequence. Data is often categorized into two main types: **numerical** and **categorical**.

##### Types of Data:
- **Numerical Data** (quantitative):  
  - Represents measurable quantities.  
  - Subtypes:  
    - **Continuous:** Can take any value within a range (e.g., height, weight).  
    - **Discrete:** Can take only specific values (e.g., number of students).

- **Categorical Data** (qualitative):  
  - Represents groupings or labels.  
  - Subtypes:  
    - **Binary:** Two categories (e.g., Yes/No).  
    - **Ordinal:** Categories with a meaningful order (e.g., Low/Medium/High).

##### Data Types Table:
| **Data Type**      | **Subtype**  | **Examples**                 |
|---------------------|--------------|------------------------------|
| **Numerical**       | Continuous   | Height, Weight, Temperature |
|                     | Discrete     | Number of items, Age in years |
| **Categorical**     | Binary       | Yes/No, On/Off              |
|                     | Ordinal      | Education Level, Rankings   |

##### Numerical Data:
- **Continuous Data:**  
  Values can take any real number within a range:  
  $$ x \in \mathbb{R} $$  
  Example: height, weight.  

- **Discrete Data:**  
  Values are specific and countable:  
  $$ x \in \mathbb{N} $$  
  Example: number of students, age in years.  

##### Categorical Data:
- **Binary Data:**  
  Two distinct categories:  
  $$ x \in \{0, 1\} $$  
  Example: Yes/No, On/Off.  

- **Ordinal Data:**  
  Ordered categories where the sequence matters:  
  $$ x_1 < x_2 < x_3 $$  
  Example: Low < Medium < High.


###Tabular Data  


#####In data analysis, **tabular data** refers to data organized in a table format, where rows and columns represent structured information. This structure is commonly used in tools like spreadsheets, databases, and programming frameworks (e.g., pandas in Python, data frames in R).

##### Key Concepts:
1. **Data Frame:**  
   A two-dimensional table structure where:  
   - **Rows** represent individual records or observations.  
   - **Columns** represent variables or features.  

2. **Feature (Attribute):**  
   A column in the table representing a specific variable or characteristic of the data.  
   Example: Age, Height, Income.

3. **Record (Observation):**  
   A row in the table representing a single instance of data.  
   Example: Data about one person or event.

4. **Outcome (Target):**  
   A specific feature that represents the result or value to predict or analyze.  
   Example: Whether a customer makes a purchase (Yes/No).

##### Example of a Data Frame:
| **Record ID** | **Name**   | **Age** | **Height (cm)** | **Purchase** |
|---------------|------------|---------|-----------------|--------------|
| 1             | Alice      | 25      | 165             | Yes          |
| 2             | Bob        | 30      | 175             | No           |
| 3             | Charlie    | 28      | 180             | Yes          |

- **Feature:** Name, Age, Height, Purchase.  
- **Record:** Row 1 (Alice), Row 2 (Bob), etc.  
- **Outcome:** Purchase column.

#####Mathematical Representation:
- Data frame:  
  $$ D = \{ (x_i, y_i) \}_{i=1}^n $$  
  Where $ x_i $ are features, and $ y_i $ is the outcome.

- Features:  
  $$ X = [x_1, x_2, \dots, x_p] $$  
  Where $ p $ is the number of features (columns).

- Records:  
  $$ R_i = (x_{i1}, x_{i2}, \dots, x_{ip}, y_i) $$  
  Where $ R_i $ represents a single row of data.

##### Summary:
- **Data Frame:** Organizes data in rows and columns.  
- **Feature:** Describes a specific property or variable.  
- **Record:** Represents a single observation.  
- **Outcome:** The target variable for prediction or analysis.



###Measures of Central Tendency  

#####In statistics, **measures of central tendency** are used to describe the center or typical value of a dataset. These measures summarize a set of data points into a single representative value. Key measures include **mean**, **median**, and others.

##### Key Measures:

1. **Mean (Average):**  
   The mean is the sum of all values in a dataset divided by the number of values. It is the most common measure of central tendency.  
   $$ \text{Mean} = \frac{1}{n} \sum_{i=1}^{n} x_i $$  
   Where $x_i$ are the values, and $n$ is the number of data points.

2. **Weighted Mean:**  
   The weighted mean accounts for the importance (weight) of each data point. It is calculated by multiplying each value by its corresponding weight, summing them, and then dividing by the total weight.  
   $$ \text{Weighted Mean} = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i} $$  
   Where $w_i$ represents the weights.

3. **Trimmed Mean:**  
   The trimmed mean is calculated by removing a specified percentage of the smallest and largest values before computing the mean. This reduces the influence of extreme values (outliers).  
   $$ \text{Trimmed Mean} = \frac{\sum_{i=k+1}^{n-k} x_i}{n-2k} $$  
   Where $k$ is the number of values removed from each end.

4. **Median:**  
   The median is the middle value of a dataset when arranged in ascending or descending order. If the dataset has an odd number of elements, the median is the middle one. If even, it is the average of the two middle values.  
   - For an odd dataset:  
     $$ \text{Median} = x_{\frac{n+1}{2}} $$  
   - For an even dataset:  
     $$ \text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} $$

5. **Weighted Median:**  
   The weighted median is the median of a dataset where each data point has a weight. It is the value that divides the dataset such that the sum of weights on one side is as close as possible to the sum of weights on the other side.  
   - There is no simple closed-form formula for the weighted median, but it can be found through sorting and cumulative weight calculations.  
   
   Formally, the weighted median $ x_k $ satisfies the condition:
   $$ W_1 + W_2 + \dots + W_k \geq \frac{1}{2} \sum_{i=1}^{n} W_i $$  
   Where:
   - $ W_i $ is the weight of each value $ x_i $,
   - $ W_k $ is the cumulative sum of weights up to the value $ x_k $,
   - $ n $ is the total number of data points.

6. **Percentile:**  
   A percentile is a value below which a certain percentage of data falls. For example, the 50th percentile is the median, the 25th percentile is the lower quartile, and the 75th percentile is the upper quartile.  
   $$ P_k = x_{\frac{k(n+1)}{100}} $$  
   Where $P_k$ is the k-th percentile.

7. **Resilience (Robustness):**  
   Resilience refers to the ability of a measure to remain unaffected by extreme values or outliers. For example, the median is more robust than the mean because it is not influenced by large outliers.

8. **Outlier:**  
   An outlier is an observation that lies far outside the range of most other data points. It can significantly affect the mean but has a minimal effect on the median.  
   An outlier can be defined as any data point that lies more than 1.5 times the interquartile range (IQR) above the 75th percentile or below the 25th percentile.  
   $$ \text{Outlier condition:} \, x_i < Q_1 - 1.5 \times IQR \text{ or } x_i > Q_3 + 1.5 \times IQR $$  
   Where $Q_1$ and $Q_3$ are the first and third quartiles, and $IQR$ is the interquartile range ($Q_3 - Q_1$).
