# Some Python Basics

## Variables

Variables in Python are dynamically typed, meaning that the type is inferred from assignment, which is often referred to as duck typing (if it acts like a duck and looks like a duck, its a duck).  


In [42]:
x1 = 5
x2 = 5.0
x3 = "5"

print(type(x1), type(x2), type(x3))

<class 'int'> <class 'float'> <class 'str'>


In [43]:
x4 = int(x3)

print(type(x4))

<class 'int'>


To get more inforation on a variables type, you can use either the type() function.

We can also get input from users to fill our variables.

In [None]:
x6 = input("Do you like sushi?")
print('There is' + x6 + 'in the rec room')




## Booleans

Booleans can be handy when working with dataframes, as we will see later.  You can also add them and they are interpreted as False = 0 and True = 1

In [None]:
x7 = True
x8 = False

print(x7 + x8)
print(x7 * x8)

## Strings
Strings in Python are created with ' or " and are immutable, if changes need to be made to a string a new one is returned.  The default encoding for strings in Python is Unicode UTF-8, this means that they are automatically compatible with different languages.  Python strings work similar to STL strings since they are classes with support functions built in, however in Python the amount of functionality is much larger.

In [None]:
x9 = 'Jacob Cotham'
x10 = 'jacob cotham'

x11 = x9.lower()

print(x11)


## Lists
In c++ choosing which container to use is actually very important (list, queue, stack, vector, array?), in Python this choice is simplified into a single container that has the functionality of everything.  To create a list use the square brackets [].  Notice that the types don't have to match, we don't care about variable types.

In [None]:
list1 = [1,2, False, 4.0]

# Pandas
In the last lesson, we got to see Pandas in action by using it to make some visualizations of the data science salaries data.  Let's take some time to explore some of the cool features of Python and Pandas.

## The History of Pandas

Origins:

* 2008: The Pandas project was started by Wes McKinney when he was working at AQR Capital Management. The main motivation was to have a flexible tool to perform quantitative analysis on financial data. The name "pandas" is derived from the term "panel data," a common term for data that involves observations over time.

Early Development:

* 2009: Wes McKinney released the first public version of pandas. The initial versions laid the foundation with data structures like Series and DataFrame, which have since become staples for data manipulation in Python.

Increasing Adoption:

* 2010s: As data science and Python grew in popularity during the 2010s, so did pandas. It quickly became one of the cornerstones of the scientific stack in Python alongside libraries like NumPy, SciPy, and Matplotlib.
The library received significant contributions from many developers worldwide, enhancing its capabilities and making it more robust.

Books and Documentation:

* 2012: Wes McKinney published "Python for Data Analysis," which prominently features pandas and its application in data analysis. This book played a crucial role in introducing many individuals to pandas and data analysis in Python.


Pandas is often seen as a gateway to data science in Python. Its simple yet powerful interface makes it a favorite for beginners and professionals alike.
With the rise of big data tools like Apache Spark, Dask, and Vaex, pandas also integrates with these tools, allowing users to scale their analyses when necessary.

## DataFrames and Series

The DataFrame is the primary structure we will be using for this class.  It is an associative, two dimensional data structure. Imagine a spreadsheet page,  SQL table, or flat file.  The series object is a one dimensional data structure that represents a single column of data.

We can manually create a DataFrame from dictionaries, lists, series, and much else.  We can also add new features to a DataFrame, or even combine multiple DataFrames.  If our data is provided to us we can read or write to a variety of different formats: CSV, Excel, SQL, JSON, URL, clipboard, etc.

A series object can be thought of as single column of a DataFrame.

## Common useful Pandas methods

### DataFrame Creation and Input/Output
- `pd.DataFrame()`: Create a DataFrame.
- `pd.read_csv()`: Read a CSV file into a DataFrame.
- `pd.read_excel()`: Read an Excel file into a DataFrame.
- `df.to_csv()`: Write a DataFrame to a CSV file.
- `df.to_excel()`: Write a DataFrame to an Excel file.

### Viewing and Inspecting Data
- `df.head()`: View the first few rows of the DataFrame.
- `df.tail()`: View the last few rows of the DataFrame.
- `df.info()`: Get a concise summary of the DataFrame.
- `df.describe()`: Generate descriptive statistics.
- `df.shape`: Get the dimensions of the DataFrame.
- `df.columns`: Get the column labels.
- `df.index`: Get the row labels.

### Selection and Filtering
- `df.loc[]`: Access a group of rows and columns by labels.
- `df.iloc[]`: Access a group of rows and columns by integer position.
- `df[df['column'] > value]`: Filter rows based on column values.
- `df.query()`: Query the DataFrame with a boolean expression.

### Grouping and Aggregation
- `df.groupby()`: Group data by one or more columns.
- `df.agg()`: Aggregate using one or more operations over the specified axis.
- `df.size()`: Get the size of the DataFrame.
- `df.sum()`: Compute the sum of values.
- `df.mean()`: Compute the mean of values.
- `df.median()`: Compute the median of values.
- `df.min()`: Compute the minimum of values.
- `df.max()`: Compute the maximum of values.
- `df.count()`: Count the number of non-NA/null observations.

### Data Cleaning and Preparation
- `df.drop()`: Drop specified labels from rows or columns.
- `df.dropna()`: Remove missing values.
- `df.fillna()`: Fill missing values.
- `df.replace()`: Replace values.
- `df.rename()`: Rename labels.
- `df.astype()`: Cast a pandas object to a specified dtype.
- `df.sort_values()`: Sort by the values along either axis.
- `df.sort_index()`: Sort by the index.
- `df.set_index()`: Set the DataFrame index using existing columns.
- `df.reset_index()`: Reset the index, or a level of it.

### Merging and Joining
- `pd.merge()`: Merge DataFrame objects by performing a database-style join.
- `df.join()`: Join columns with other DataFrame.
- `pd.concat()`: Concatenate pandas objects along a particular axis.

### Date and Time
- `pd.to_datetime()`: Convert argument to datetime.
- `df['column'].dt`: Accessor object for datetime-like properties.

### String Methods
- `df['column'].str`: Accessor object for string methods.
- `df['column'].str.contains()`: Test if pattern or regex is contained within a string of a Series or Index.
- `df['column'].str.replace()`: Replace occurrences of pattern/regex/string with some other string.

### Statistical Functions
- `df.corr()`: Compute pairwise correlation of columns.
- `df.cov()`: Compute pairwise covariance of columns.
- `df.var()`: Compute variance of columns.
- `df.std()`: Compute standard deviation of columns.
- `df.mad()`: Compute mean absolute deviation of columns.
- `df.kurt()`: Compute kurtosis of columns.
- `df.skew()`: Compute skewness of columns.

### Visualization
- `df.plot()`: Make plots of DataFrame using matplotlib.

### Miscellaneous
- `df.pivot()`: Produce pivot table based on 3 columns of this DataFrame.
- `df.pivot_table()`: Create a spreadsheet-style pivot table as a DataFrame.
- `df.apply()`: Apply a function along an axis of the DataFrame.
- `df.applymap()`: Apply a function to a DataFrame elementwise.


## Data wrangling

Let's explore the cybersecurity threat data using Pandas methods.

In [None]:
import pandas as pd

cyber_df = pd.read_csv('assets/Cybersecurity.csv')

cyber_df.head()

In [None]:
cyber_df.tail()

Let's create a subset that just includes attacks in education.

In [None]:
cyber_df.info()

In [None]:
cyber_df.describe()


In [None]:
cyber_df.describe(include='object')

In [None]:
cyber_df.shape


In [None]:
cyber_df['Attack Type'].value_counts()


In [None]:
cyber_df[ ['Attack Type', 'Target Industry'] ].value_counts()


In [None]:
cyber_df.isnull().sum()


In [None]:
titanic_df = pd.read_csv('assets/titanic_passengers.csv')

titanic_df.head()

In [None]:
titanic_df.isnull().sum()

Let's make a subset of Education attacks with more than a 50 in loss.

In [None]:
education_df = cyber_df.loc[ cyber_df['Target Industry'] == 'Education']

education_df.head()

Unnamed: 0,Country,Year,Attack Type,Target Industry,Financial Loss (in Million $),Number of Affected Users,Attack Source,Security Vulnerability Type,Defense Mechanism Used,Incident Resolution Time (in Hours)
0,China,2019,Phishing,Education,80.53,773169,Hacker Group,Unpatched Software,VPN,63
12,India,2019,Ransomware,Education,30.56,583204,Insider,Zero-day,Firewall,37
21,France,2023,Ransomware,Education,17.72,261661,Insider,Social Engineering,VPN,11
26,Japan,2022,Malware,Education,53.04,570494,Nation-state,Unpatched Software,VPN,53
30,UK,2022,SQL Injection,Education,66.24,678876,Hacker Group,Social Engineering,AI-based Detection,11


In [36]:
education_df['Target Industry'].value_counts()

Target Industry
Education    419
Name: count, dtype: int64

Let's make a subset of Education attacks OR with more than 50 million in loss.

In [39]:
expensive_education_df = cyber_df.loc[ (cyber_df['Target Industry'] == 'Education') & 
                                        (cyber_df['Financial Loss (in Million $)'] > 50) ]
expensive_education_df.head()



Unnamed: 0,Country,Year,Attack Type,Target Industry,Financial Loss (in Million $),Number of Affected Users,Attack Source,Security Vulnerability Type,Defense Mechanism Used,Incident Resolution Time (in Hours)
0,China,2019,Phishing,Education,80.53,773169,Hacker Group,Unpatched Software,VPN,63
26,Japan,2022,Malware,Education,53.04,570494,Nation-state,Unpatched Software,VPN,53
30,UK,2022,SQL Injection,Education,66.24,678876,Hacker Group,Social Engineering,AI-based Detection,11
39,Brazil,2016,DDoS,Education,96.98,140812,Nation-state,Unpatched Software,VPN,71
59,Australia,2024,Phishing,Education,93.32,93185,Unknown,Unpatched Software,VPN,14


In [40]:
expensive_education_df['Financial Loss (in Million $)'].describe

<bound method NDFrame.describe of 0       80.53
26      53.04
30      66.24
39      96.98
59      93.32
        ...  
2909    98.61
2912    93.39
2934    55.50
2958    58.86
2993    54.98
Name: Financial Loss (in Million $), Length: 200, dtype: float64>

In [41]:
expensive_or_education_df = cyber_df.loc[ (cyber_df['Target Industry'] == 'Education') | 
                                        (cyber_df['Financial Loss (in Million $)'] > 50) ]
expensive_or_education_df.head()

Unnamed: 0,Country,Year,Attack Type,Target Industry,Financial Loss (in Million $),Number of Affected Users,Attack Source,Security Vulnerability Type,Defense Mechanism Used,Incident Resolution Time (in Hours)
0,China,2019,Phishing,Education,80.53,773169,Hacker Group,Unpatched Software,VPN,63
1,China,2019,Ransomware,Retail,62.19,295961,Hacker Group,Unpatched Software,Firewall,71
4,Germany,2018,Man-in-the-Middle,IT,74.41,810682,Insider,Social Engineering,VPN,68
5,Germany,2017,Man-in-the-Middle,Retail,98.24,285201,Unknown,Social Engineering,Antivirus,25
7,France,2018,SQL Injection,Government,59.23,909991,Unknown,Social Engineering,Antivirus,66


Let's create a new feature where we categorize incident resolution time as > 24 hours or <=24 hours

Calculate the average number of affected users by attack type.

Average finanical loss for each security vulneratiliby

## The case for using AI


When is using AI a good idea?

When is using AI a bad idea?

## Asking interesting questions

What are 5 interesting questions we can ask about our data?

How can we use AI to help us write Pandas to answer them?