# Python and Data Science

Python is open source, interpreted, high level language and provides great approach for object-oriented programming. It is one of the best language used by data scientist for various data science projects/application. Python provide great functionality to deal with mathematics, statistics and scientific function. It provides great libraries to deals with data science application.

One of the main reasons why Python is widely used in the scientific and research communities is because of its ease of use and simple syntax which makes it easy to adapt for people who do not have an engineering background. It is also more suited for quick prototyping.

![](https://www.brsoftech.com/blog/wp-content/uploads/2019/11/most-in-demand-programming-languages-2020.png)

# Is Python a New Language?

Python was first released in 1991. It was created by Guido van Rossum as a hobby project. 

It was named after a comedy TV series.

![Monty Python](https://upload.wikimedia.org/wikipedia/en/c/cd/Monty_Python%27s_Flying_Circus_Title_Card.png)

# Computing for Everybody
As python was becoming popular, Van Rossum submitted a funding proposal to DARPA called "Computer Programming for Everybody", in which he further defined his goals for Python:
- An easy and intuitive language just as powerful as major competitors
- Open source, so anyone can contribute to its development
- Code that is as understandable as plain English
- Suitability for everyday tasks, allowing for short development times

> In 2018, Python was the third most popular language on GitHub, a social coding website, behind JavaScript and Java.

According to a programming language popularity survey it is consistently among the top 10 most mentioned languages in job postings. Furthermore, Python has been among the 10 most popular programming languages every year since 2004 according to the TIOBE Programming Community Index.

# The Zen of Python

The Zen of Python is a collection of 19 "guiding principles" for writing computer programs that influence the design of the Python programming language. Software engineer Tim Peters wrote this set of principles and posted it on the Python mailing list in 1999. Peters's list left open a 20th principle "for Guido to fill in", referring to Guido van Rossum, the original author of the Python language. The vacancy for a 20th principle has not been filled.

- Beautiful is better than ugly.
- Explicit is better than implicit.
- Simple is better than complex.
- Complex is better than complicated.
- Flat is better than nested.
- Sparse is better than dense.
- Readability counts.
- Special cases aren't special enough to break the rules.
- Although practicality beats purity.
- Errors should never pass silently.
- Unless explicitly silenced.
- In the face of ambiguity, refuse the temptation to guess.
- There should be one—and preferably only one—obvious way to do it.
- Although that way may not be obvious at first unless you're Dutch.
- Now is better than never.
- Although never is often better than right now.
- If the implementation is hard to explain, it's a bad idea.
- If the implementation is easy to explain, it may be a good idea.
- Namespaces are one honking great idea—let's do more of those!

### Getting Started

Following line prints a message. To execute this line, select the line below and do one of the following:
1. Click the RUN button on the top
1. On the menubar on the top, click CELL > RUN CELLS
1. Press Ctrl + ENTER on the keyboard

In [None]:
print ("This is Python!")

Try to write a line yourself below. You can add an empty block of code using the following:
1. Click INSERT > INSERT CELL BELOW
1. Press ESC key on the keyboard to enter the command mode. The color on the left will change. Then press 'b' to add a cell 'below'

In a new code block, write a line that will print your name.

What does the next piece of code do?

In [None]:
a = 10
b = 15

if b > a:
    print("B is greater")

In [None]:
fruits = ["Apple", "Banana", "Mango"]

for fruit in fruits:
    print ("I eat "+fruit) 

# What is Anaconda?

Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. 

There are several alternatives, however Anaconda is the most popular due to simplicity of managing the python components.

Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive computational environment for creating Jupyter notebook documents. The "notebook" term can colloquially make reference to many different entities, mainly the Jupyter web application, Jupyter Python web server, or Jupyter document format.

# Markdown

You can edit this block to see how it is formatted. Double-click anywhere on this text to activate.

This is text. This is not python code. This will not run. 

But this will be displayed properly.

# This is a heading

## This is a smaller heading

### This is an even smaller heading



Here're the reasons why you should use markdown cells:
1. It makes your notes look better
1. It helps other programmers understand what you are doing

# Comments in Python

Comments are used to explain the code, make notes to help other programmers, or make notes for future scope.
They are mostly used to make code readable.

They are part of the code, not the markdown.

In [None]:
print ("Hello World!") # This is a string statement
print (5+9) # This is a number
print ('The end!') # Bye

In [None]:
def add_nums_and_print(num1, num2):
    """
    This function adds two numbers and prints the sum
    Parameters:
    num1: firstnumber
    num2: second number
    Returns: Sum of the two numbers
    """
    ans = num1+num2
    print ("The answer is "+str(ans))
    
add_nums_and_print(5,10)

In Python, you can use the help() function to see the details of the function.

In [None]:
help(print)

# Five Minute Python Refresher

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. 

# Data Structures

### data structure
An organization of data for the purpose of making it easier to use.
### immutable data value
A data value which cannot be modified. Assignments to elements or slices (sub-parts) of immutable values cause a runtime error.
### mutable data value
A data value which can be modified. The types of all mutable values are compound types. Lists and dictionaries are mutable; strings and tuples are not.


## Following are the primary Data Structures that we will study:
**1. Strings**
Collection of unicode characters. It is indexed and immutable, "hello world!"

**2. Lists**
Collection of elements. It is indexed and mutable. Allows duplicates, \[10,20,30\]

**3. Tuple**
Collection which can be indexed but immutable, (apple, 250)

**4. Set**
Collection of unordered elements that doesn't allow repetitions, {apple, orange}

**5. Dictionary**
Collection of Key-Value pairs, {key:value}

# Control Flow

A program’s control flow is the order in which the program’s code executes. The control flow of a Python program is regulated by conditional statements, loops, and function calls. This section covers the if statement and for and while loops; functions are covered in the next class.

![](https://www.researchgate.net/profile/Kay_Smarsly/publication/322509045/figure/fig1/AS:583153716215809@1516046088625/Control-flow-of-elementary-control-structures.png)

# The if Statement

Often, you need to execute some statements only if some condition holds, or choose statements to execute depending on several mutually exclusive conditions. The Python compound statement if, which uses if, elif, and else clauses, lets you conditionally execute blocks of statements. 

## Comparision Operators
- x == y
- x!= y
- x < y
- x <= y
- x > y
- x >= y

## Python Indentation
In Python, the code blocks are defined by a set of common or consistent number of spaces. This is called Python Indentation.

The block scope will end at the first un-indented line.

The best practice is to use on Tab space.

##  Task 1.1: Fizz Buzz
Write code to print all the numbers upto a given number "N", replacing every multiple of 3 by the word "fizz" and multiple of 5 by the word "buzz", and multiples of both 3 and 5 by "fizzbuzz"

For N=7, print the output:\
1 2 fizz 4 buzz fizz 7

All the items are printed in the same line

In [None]:
max_num = 16

for fizzbuzz in range(1, max_num+1):
    if fizzbuzz % 3 == 0 and fizzbuzz % 5 == 0:
        print("fizzbuzz")
        continue
    elif fizzbuzz % 3 == 0:
        print("fizz")
        continue
    # complete the remaining section of this block

## Task 1.2: Create a Loan Interest Calculator Using Simple Interest

Create a function that takes price, downpayment, rate of interest and time duration (computed yearly). Return the total amount user has to pay due to the interest excluding downpayment.

In [None]:
def get_amount(price, downpayment, rate, time):
    principle = price-downpayment
    interest = 0 # what should you write here?
    return principle+interest

get_amount(1000, 200, 4, 10)

# Working with Data

## Task 3: Find average value in a list

Make a list of 10 numbers; say, each number represents grade of a student in the class. You can make list as:

grades = [6,8,5,3,7,9,8,7,...]

Find the average grade of students in the class without using the add symbol.

In [None]:
grades = [6,8,5,3,7,9,8,7,6,7]
average_grades = # write your solution here

print ("Average grade is "+str(average_grades))

## Task 3: Finding values from a Dictionary
Make a dictionary with names and grades of 10 students in the class.

In [None]:
grades = {"Brandon":6, "Floris":8, "Levi":5, "Sanne":3, "Abdel":7, "Mila":9, "Daan":8, "Sophie":7, "Nora":6, "Mees":7 }

In [None]:
for name in grades:
    print (name, grades[name])

Find grades of Floris.

In [None]:
 # write your solution here

What happens when you try to find grades of a student who is not in the list?

In [None]:
grades["Matt"]

### Task 3.1: Find name of the student with the highest score.
If more than one students have the highest score, you can print any name.

In [None]:
max_score = -1
max_student = None

for name in grades:
     # write your solution here
        
print (max_student, grades[max_student])

### Task 3.2: A list of grades of the students is below. Find the average and variance.

![](https://www.gstatic.com/education/formulas2/553212783/en/population_standard_deviation.svg)

In [None]:
grades = [6,8,5,3,7,9,8,7,6,7]

 # write your solution here

### Task 3.3: Find the Standard Deviation

In [None]:
import math

 # write your solution here

### Discussion: 

- What does average tell us about the students? 
- What does the standard deviation tell us?
- How can we use mean and standard deviation to analyze the data?

# What the #%@&

## Task 4: Fix the Errors
The following piece of code is not running. Can you fix it?

In [None]:
grades = [9,8,2,4,7,9,8,4,6,9]
mean = sum(grades)/len(grades)
deviations = [(x - mean) ** 2 for x in grades]
variance = sum(deviations) / len(grades)
std_dev = sqrt(var)
print ("Standard Deviation is " + std_dev)

## Example: Statistics of two different lists

You are given two lists containing scores of students from two different cities. Find their means, standard deviations.

Can you describe your observations?



In [None]:
import math

def find_mean(grades):
    return sum(grades)/len(grades)

def find_std(grades):
    mean = find_mean(grades)
    deviations = [(x - mean) ** 2 for x in grades]
    variance = sum(deviations) / len(grades)
    std_dev = math.sqrt(variance)
    return std_dev

def find_stats(grades):
    mean = find_mean(grades)
    std = find_std(grades)
    return (mean, std)
    
utrecht_grades = [9,8,2,2,7,9,8,4,8,9]
amsterdam_grades = [7,6,7,3,7,9,8,7,6,6]

print("Utrecht: \t"+ str(find_stats(utrecht_grades)))
print("Amsterdam: \t"+ str(find_stats(amsterdam_grades)))    

find_mean(utrecht_grades)

Combine both the lists into one list. Find mean and std again. You can use the functions we've defined below. Discuss what changed.

In [None]:
all_grades = utrecht_grades + amsterdam_grades

print("All Students: \t"+ str(find_stats(all_grades)))    

### Visualizing the Distributions

In [None]:
import matplotlib.pyplot as plt
import numpy as np
plt.hist(utrecht_grades, bins=range(1, 11), alpha=0.5, label='Utrecht')
plt.show()


In [None]:
plt.hist(amsterdam_grades, bins=range(1, 11), alpha=0.5, label='Amsterdam')
plt.show()

# Introduction to Numeric Analysis Tools

NumPy, short for Numerical Python, addresses the limitations of traditional Python lists when it comes to numerical computations. Python lists are flexible but lack the optimized structure needed for handling large datasets and performing complex mathematical operations. NumPy bridges this gap by introducing a powerful array object that facilitates vectorized operations and enhances computational efficiency.

![https://i1.wp.com/indianaiproduction.com/wp-content/uploads/2019/06/NumPy-array.png?resize=768%2C368&ssl=1](https://i1.wp.com/indianaiproduction.com/wp-content/uploads/2019/06/NumPy-array.png?resize=768%2C368&ssl=1)

In [None]:
import numpy

In [None]:
grades_array = numpy.array(all_grades)
grades_array

In [None]:
type(grades_array)

In [None]:
numpy.max(grades_array)

## Two Dimensional Arrays

In [None]:
import numpy as np

In [None]:
ll = [[1,2,3],[4,5.0,6]]

arr = np.array(ll)
arr.shape
type(arr)
arr

In [None]:
ll = [[1,"Hi"],["hello",6]]
arr = np.array(ll)
arr

In [None]:
list(range(10,30,2))

In [None]:
arr = np.arange(10)
arr

### Special Functions for Creating Arrays

In [None]:
np.arange(0.5,10.4,0.8)


In [None]:
np.ones((3,6))

In [None]:
np.zeros((3,3))

In [None]:
np.eye(3)

In [None]:
np.linspace(3,4,num=9)

In [None]:
np.random.rand(10,2)

In [None]:
np.random.randint(10,20,(3,2))

### Indexing, Slicing

In [None]:
A = np.array(  [  [3.4,8.7,9.9],
                [1.1,-7.8,-0.7],
                [4.1,12.3,4.8]])
A[1]

In [None]:
A[1][2]

In [None]:
arr = np.arange(100)
arr[10:30]

### Statistics with Numpy

In [None]:
grades_array.mean()

In [None]:
grades_array.std()

# Data Analysis



## Task 5: Explore and Discuss:
Work in pairs/groups. You are given data from an application log of a learning management system. Read data from Pandas dataframe. Look at the structure of the data to understand the format of the data. Download the dataset here: https://studylens.science.uu.nl/media/ovp/userlog.csv

You will need to *upload* the dataset so that jupyter notebook can read it, or refer to the absolute path in your system.

You will then parse this file to retrieve details about the student that can help you summarize the students' use of the system in a much simpler way. Here're some ideas to get you started:
1. What does each column mean? 
1. What does each action indicate? How to interpret the target?
1. What can you extract from action and target? 
1. How would you process each row iteratively (one by one)?
1. How would you process a whole column with one line of code?

In [None]:
import pandas as pd
filepath = "C:\\Users\\Joshi008\\OneDrive - Universiteit Utrecht\\OVP Python 2024\\userlog.csv"
userlog = pd.read_csv(filepath)

Use .head(), .tail() methods to take a look at the top and bottom rows of the dataset.

In [None]:
userlog.head()

You can explore the columns using the following:

In [None]:
for i in range(len(userlog.columns)):
    print(userlog.columns[i], userlog.dtypes[i])

What do these mean?

### Convertime timestamp to a readable format.

timestamp is object. This will not be useful unless we convert it to date-time.

In [None]:
pd.to_datetime(userlog['timestamp'], format='%d-%m-%y %H:%M:%S %p')

In [None]:
userlog['timestamp'] = pd.to_datetime(userlog['timestamp'], format='%d-%m-%y %H:%M:%S %p')

In [None]:
userlog.dtypes

The following lines can be used to explore the dataset row by row. It is an easy approach but computationally, it can make your program run slower.

In [None]:
# for i, row in userlog.iterrows():
#     print (i, row['action'])

The following line can extract item at 3rd row and 2nd column.

In [None]:
userlog.iloc[2, 1]

### What are all possible actions a user might have taken?

In [None]:
userlog.action

In [None]:
userlog.action.values

In [None]:
userlog.action.unique()

## Task 6: Data Preparation
Once you understand the basics, you will extract items like:
1. How many unique sessions did the student have?
1. What were the dates when the user was active? Can you find how many weekends they were active on?
1. Did the user load an activity multiple times? Which activities were loaded the most?
1. If you find what each "target" means, can you separate the number of loaded activities into articles read and videos watched?

Hint:

Pandas also allows using "dt" to access date time objects.
```
for date in userlog.timestamp.dt.date.unique():
    print (date, date.strftime('%A'))
```

# Loading Data In The Real World

In this problem, you will use an external API to prepare your dataset. 

Application Programming Interfaces (APIs) serve as intermediaries that enable communication and interaction between different software applications. APIs define a set of rules and protocols that allow one software system to request and exchange information with another. 

In Python, utilizing APIs involves sending HTTP requests to a specified endpoint, typically in the form of URLs, and receiving responses in a structured data format such as JSON or XML. To access data through an API, you would use Python libraries like requests to send HTTP requests and handle responses. Once data is retrieved, it can be processed using various Python libraries, such as pandas for data manipulation and analysis. 

The processing steps may include extracting relevant information, transforming data types, and cleaning the data. Finally, to prepare a single flat file, you can consolidate the processed data into a DataFrame using pandas and export it to a CSV or other flat file format, making it easily accessible for further analysis or sharing with other applications.

### Example of using an API

Before we get deeper into our problem, let's begin with a simple API. Take a look at https://restcountries.com/

**Requests Module in Python:**

The `requests` module in Python is a versatile HTTP library that simplifies the process of making HTTP requests. It provides a convenient API for sending HTTP/1.1 requests, handling response content, and managing various aspects of HTTP communication. The module is widely used for interacting with web APIs and fetching data from the internet. Its simplicity and readability make it a popular choice for HTTP-related tasks. 

`Requests` eliminates the need for manually handling low-level HTTP operations and provides a high-level interface for common use cases, such as sending GET and POST requests. 

The alternative to `requests` includes lower-level libraries like `urllib`, but `requests` stands out for its user-friendly syntax and comprehensive functionality. In the background, the `requests` module abstracts the complexities of building and sending HTTP requests, managing sessions, and handling cookies.


Usually, you can simply send a GET or a POST request with the following syntax.
```
import requests
response = requests.get("http://apiendpoint.com?param=123")
```
Then you can process the response to use the data fetched from the API.

### Example: Get data about a country from the API

In [None]:
import requests

# API endpoint for REST Countries
api_endpoint = "https://restcountries.com/v3.1/name/"

# Country name (replace with the desired country)
country_name = "netherlands"

# Make a request to the API
response = requests.get(f"{api_endpoint}{country_name}")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    country_data = response.json()

    # Display information about the country
    print("Country Information:")
    print("Name:", country_data[0]['name']['common'])
    print("Capital:", country_data[0]['capital'][0])
    print("Population:", country_data[0]['population'])
    print("Region:", country_data[0]['region'])
else:
    print(f"Error: Unable to retrieve information. Status code: {response.status_code}")

# API for Student Information

The API provided by the learning system can be used for obtaining the data. In more complex systems, you will have to register with the data provider and obtain the API key. We have set up an API key for you. 
```
KEY = '20240202OVP'
APP = 'ovpexample'
```
You will provide these along with your request.

How the API works:

First you will obtain a list of students. This API key authorizes you to access data about a limited list of students. No personal data is shared.

Write working on large projects, you will upload your code on a shared repository, which in some cases, might be public. You will store KEY and APP in a separate file or as `environment` variables. For this example, you don't have to worry about that. This API will be disabled after the course.

In [None]:
import requests

api_key = 'YOUR_API_KEY'
api_app = 'YOUR_APP'

api_endpoint = 'https://studylens.science.uu.nl/ovp/'
params = {'appid': api_app, 'key': api_key}

response = requests.get(api_endpoint, params=params)

if response.status_code == 200:
    server_response_data = response.json()
    print (server_response_data)
else:
    print(f"Error: Unable to retrieve data. Status code: {response.status_code}")

Here's API specification defined by this software.

**Get Student List** `/ovp/students?courseid=401`
Get a list of all the students in the course assigned to courseid 401. The response will contain a list of student ids.

**Get Student Details** `/ovp/studentrecord?studentid=<STUDENTID>`
Get a json object with all the details about student's activities in the system.

### Testing the status of the API

In [None]:
import requests
response = requests.get('https://studylens.science.uu.nl/ovp/', params=params)

if response.status_code == 200:
    server_response_data = response.json()
    print (server_response_data)
else:
    print(f"Error: Unable to retrieve data. Status code: {response.status_code}")

## API Parameters
Run the following block of code to prepare the variables we will need to send to the API.

In [None]:
API_ENDPOINT = 'https://studylens.science.uu.nl/ovp/'
KEY = ''
APP = ''
COURSEID = '2024214'

### Task 6.1: Get a list of all the students in this course

Fetch a list of all the student ids present in this course. Save it to a list with variable name ```studentids```

In [None]:
params = {'appid': APP, 'key': KEY, 'courseid':COURSEID}
url = api_endpoint+"/students"
print (url)

response = requests.get(url, params=params)

print (response.status_code)
print (response.json())

studentids = # write your solution here

### Example: First Get details of one student

Pick student id, say, 1167799. We will make it work for one student. Once it is ready, we can write a loop to write details of all the students, collect them in a list, and convert it to a dataframe

In [None]:
url = api_endpoint+"/getstudentrecord"
params = {'appid': APP, 'key': KEY, 'studentid':'1167799'}
response = requests.get(url, params=params)
response.json()

### Task 6.2: Collect details of all the students in the course

In [None]:
url = api_endpoint+"getstudentrecord"

student_data = []
for student in studentids:
    params = {'appid': APP, 'key': KEY, 'studentid':str(student)}
    response = requests.get(url, params=params)
    
    student_data.append( # write your solution here )

### Convert to a dataframe and save as a CSV file

It is timeconsuming to approach an API whenever you need a piece of data. It is better to not send request for the same data item multiple times. Some APIs restrict you to the number of requests you can send per hour, or each request might cost something. In some government applications, if you are accessing sensitive data, your API key may be valid only for a short amount of time.

Let's convert all the data to a Dataframe, that can be stored as a CSV for the future use.

In [None]:
import pandas as pd
df = pd.DataFrame(student_data, columns = ['UserID','FirstLoginDate','UnqDay','UnqWeeknd','NumSessions','UnqArticlesRead','UnqVideosWatched','QuizAttempted','QuizReviewed','ArticlesVisited','VideosWatched','Quiz1','Quiz2','Quiz3','Quiz4','Quiz5','Quiz6','Quiz7','Quiz8','Quiz9','Quiz10','AvgQuizGrade','CourseGrade','FinalResult'])
df.to_csv('mydata.csv')

In [None]:
df = pd.read_csv('mydata.csv')

# Visualizing the Dataset

Visualising and graphing data is an integral part of Data analysis. Visual cues can often convey the information better compared to textual info and more over, more information can also be conveyed using visuals. Visualizing data is useful because it allows us to see relationships in data in a fast and intuitive way. It's especially helpful in exploring data and deciding what to dig into next, because it can point to places where significant patterns might be present.

EDA or `Exploratory Data Analysis` is the process of critically analysing data so as to dig out hidden and meaningful insights.
- Understanding and gaining insights into the data before moving on to building a model are important tasks.
- Understanding each variable and their relationships is also necessary. 

The following are the major steps:
- Visualize the data and gain simple insights.
- Perform simple descriptive statistics.
- Perform univariate analysis for each variable.
- Create derived variables and metrics if necessary.
- Find the correlation between the variables.


## Univariate Analysis

Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable. Since it's a single variable it doesn’t deal with causes or relationships.  The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

You can think of the variable as a category that your data falls into. One example of a variable in univariate analysis might be "age". Another might be "height". Univariate analysis would not look at these two variables at the same time, nor would it look at the relationship between them.

Some ways you can describe patterns found in univariate data include looking at mean, mode, median, range, variance, maximum, minimum, quartiles, and standard deviation. Additionally, some ways you may display univariate data include frequency distribution tables, bar charts, histograms, frequency polygons, and pie charts.


## Bivariate Analysis
Bivariate analysis is used to find out if there is a relationship between two different variables. Something as simple as creating a scatterplot by plotting one variable against another on a Cartesian plane (think X and Y axis) can sometimes give you a picture of what the data is trying to tell you. If the data seems to fit a line or curve then there is a relationship or correlation between the two variables.  For example, one might choose to plot caloric intake versus weight.

## Multivariate Analysis
Multivariate analysis is the analysis of three or more variables.  There are many ways to perform multivariate analysis depending on your goals.

This is relatively more complex and requires advanced understanding of statistics and data analysis methods.

## Univariate Analysis vs Bivariate Analysis
| **Univariate Analysis**                           | **Bivariate Analysis**                                           |
|-----------------------------------------------|--------------------------------------------------------------|
| Involves a single variable                    | Involves two variables                                       |
| Deals with intrinsic property of the data     | Deals with cause and relationships between the two variables |
| Major purpose is to describe                  | Major purpose is to explain                                  |
| Mean, Median, Mode, Range, Standard Deviation | Correlation, Relationships, Causal Explanations              |

Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples. In 1962, John W. Tukey called for a reformation of statistics in his seminal paper “The Future of Data Analysis”. He proposed a new scientific discipline called data analysis that included statistical inference as just one component. Tukey forged links to the engineering and computer science communities (he coined the terms bit, short for binary digit, and software), and his original tenets are suprisingly durable and form part of the foundation for data science. The field of exploratory data analysis was established with Tukey’s 1977 now-classic book Exploratory Data Analysis

People are not very good at looking at a column of numbers or a whole spreadsheet and then determining important characteristics of the data. They find looking at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate). Each of the categories of EDA have further divisions based on the role (outcome or explanatory) and type (categorical or quantitative) of the variable(s) being examined.

# Understanding the Data

In [None]:
import seaborn as sns
import pandas as pd

df = pd.read_csv('mydata.csv')

### How is Course Grade Distributed

In [None]:
sns.histplot(df['CourseGrade'], color='Green')

### How are the number of sessions Distributed

In [None]:
sns.histplot(df['NumSessions'], color='Green')

### Boxplot - another look at the distribution

In [None]:
plt.figure(figsize=(12,10))
sns.boxplot(y='NumSessions',data=df)
plt.xticks(rotation=45)
plt.show()


### Scatterplot - relationship between two variables

Does how many times (sessions) a student logged-in affect their final coursegrade?

In [None]:
sns.scatterplot(data = df, x='NumSessions', y='CourseGrade')

### Heatmap - are some variables correlated with each other?

In [None]:

sns.heatmap(df[["UnqDay","UnqWeeknd","NumSessions","UnqArticlesRead","UnqVideosWatched","QuizAttempted","QuizReviewed"]].corr(), cmap='Blues', annot =True )