# Homework 1 Assignment

## Task

1. Choose a topic for further work with the database and using one of the resources for data selection, download a data set in the form of a CSV file or several files that interest you.
2. Using Python, perform the following operations on the data (see the "Instructions" file) selected from the resource. Connect the library via `import pandas as pd` and then:
    - create a DataFrame from 4 columns, 4 rows, using the `pd.DataFrame` command;
    - load a CSV file with real data using the `pd.read_csv` command;
    - apply command `info()`, `describe()`, `columns`, `head()`;
    - highlight the required columns using the specifics of the data;
    - select the required lines using the specifics of the data;

## Solution

### Step 0: Warmup

First, let us create some random dataframe, as requested by the task

In [16]:
# Creating a table with columns "name", "job", "salary", and "gpa"
# We make four rows, as requested
test_df = pd.DataFrame(data = {
    'name': ['Denis', 'Oleksandr', 'David', 'Misha'],
    'job': ['Cryptographer', 'ML Engineer', 'Student', 'PhD Student'],
    'salary': [50000, 60000, 0, 1],
    'gpa': [4.5, 4.6, 4.9, 3.0]
})

# Printing the resultant dataframe
test_df

Unnamed: 0,name,job,salary,gpa
0,Denis,Cryptographer,50000,4.5
1,Oleksandr,ML Engineer,60000,4.6
2,David,Student,0,4.9
3,Misha,PhD Student,1,3.0


### Step 1
We use the ["Students Performance in Exams"](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams/code) dataset from Kaggle. The main CSV file we are referring to is installed locally and can be found in the GitHub repository, attached to the assignment.

That being said, let's get started:

In [4]:
import pandas as pd

# Specifying path to the file containing our dataset
DATASET_PATH = 'StudentsPerformance.csv'

# Loading the dataframe
df = pd.read_csv(DATASET_PATH, delimiter=',')
df # Showing the dataframe

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


### Step 2
Now, displaying all the parameters which were asked in the task:
- `df.info()`
- `df.describe()`
- `df.columns`
- `df.head()`

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [6]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [7]:
# Printing columns present in the dataset
[column for column in df.columns]

['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course',
 'math score',
 'reading score',
 'writing score']

In [8]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Step 3.
Now, let us select only last 4 columns

In [9]:
columns_of_interest = df.columns[4:] # Columns: [test preparation course, math score, reading score, writing score]
df[columns_of_interest].head(12) # Showing first 12 

Unnamed: 0,test preparation course,math score,reading score,writing score
0,none,72,72,74
1,completed,69,90,88
2,none,90,95,93
3,none,47,57,44
4,none,76,78,75
5,none,71,83,78
6,completed,88,95,92
7,none,40,43,39
8,completed,64,64,67
9,none,38,60,50


### Step 4.
And now let us select only rows where the math score is above (or equal to) 95 and the reading score is above (or equal to) 90. 

Additionally, we display the number of such rows.

In [10]:
# Defining thresholds
MATH_SCORE_THRESHOLD = 95
READING_SCORE_THRESHOLD = 90

# Selecting those rows where the math score and reading score are above the threshold
genius_students = df[(df['math score'] >= MATH_SCORE_THRESHOLD) & (df['reading score'] >= READING_SCORE_THRESHOLD)] 
print(f'There are {len(genius_students)} reported students with amazing math and reading scores')

# Let us show 10 of them, including only four last columns
genius_students

There are 17 reported students with amazing math and reading scores


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
114,female,group E,bachelor's degree,standard,completed,99,100,100
149,male,group E,associate's degree,free/reduced,completed,100,100,93
165,female,group C,bachelor's degree,standard,completed,96,100,100
179,female,group D,some high school,standard,completed,97,100,100
263,female,group E,high school,standard,none,99,93,90
451,female,group E,some college,standard,none,100,92,97
458,female,group E,bachelor's degree,standard,none,100,100,100
539,male,group A,associate's degree,standard,completed,97,92,86
562,male,group C,bachelor's degree,standard,completed,96,90,92
623,male,group A,some college,standard,completed,100,96,86
