
    
<img src="https://astanait.edu.kz/wp-content/uploads/2020/05/aitu-logo-3.png" alt="alt text" width="150" height="200" class="blog-image">
  

<h1 style="text-align:center;">Big Data in Law Enforcement (practice) </h1>

<h1 style="text-align:center;">Pandas, Numpy </h1>




### Part 1: Pandas: DataFrame and Series 

Pandas, a widely used library in data analysis that provides pre-written code and functions to help tackle various data-related tasks. 
It offers a rich set of tools for working with structured data, making it indispensable for data scientists, analysts, and engineers across various industries.

Here are some key aspects of Pandas:

**Data Structures:** Pandas provides two primary data structures: DataFrame and Series. A DataFrame is a flexible two-dimensional table with labeled rows and columns, while a Series is a labeled one-dimensional array, akin to a single column or row of data.

**Data Handling:** Pandas excels in managing data from diverse sources. It simplifies reading data from CSV files, Excel spreadsheets, SQL databases, and more. It also supports exporting data in these formats, enabling seamless data interchange.

**Data Integration:** Pandas enables you to combine data from different sources efficiently. With functions like merge and join, like in SQL operations, you can create comprehensive datasets by merging multiple DataFrames.

**Efficient Indexing:** Pandas offers efficient indexing and selection mechanisms. You can quickly access specific rows and columns of your data, which is especially helpful when dealing with large datasets.

**Customization:** Pandas allows you to create custom data structures and manipulate data to suit your specific requirements. This extensibility is invaluable when you need to adapt Pandas to unique data processing tasks.

In summary, Pandas is a go-to tool for managing and analyzing data in various formats, making it an essential asset for anyone involved in data-related work.


In [5]:
#install Pandas
!pip install pandas
#import the Pandas Library
import pandas as pd



After the import command, we now have access to a large number of pre-built classes and functions. This assumes the library is installed;

**Let's create a DataFrame out of a dictionary.**


In [4]:
#Define a dictionary 'students'

students = {'ID': [1, 2, 3, 4], 'Name': ['Rose','John', 'Jane', 'Mary'],  'Group': ['CS_23', 'CS_25', 'CS_20', 'CS_13'], 
      'Grade':[95, 80, 90, 75]}

#casting the dictionary to a DataFrame
df = pd.DataFrame(students)

#display the result df
df

Unnamed: 0,ID,Name,Group,Grade
0,1,Rose,CS_23,95
1,2,John,CS_25,80
2,3,Jane,CS_20,90
3,4,Mary,CS_13,75


**Column Selection**
In order to select column you have to use two brackets, to view the column as a series, you can use just one bracket:

In [16]:
#Retrieving the "Name" column and assigning it to a variable name
name = df[['Name']]
name


Unnamed: 0,Name
0,Rose
1,John
2,Jane
3,Mary


Using <code>type()</code> function you can check the type of the variable.


In [8]:
#check the type of Name
type(name)

pandas.core.frame.DataFrame

As shown in the output the type of the variable is a DataFrame object.


**Accessing to multiple columns**

We can retrieve the data for <code>Id</code>, <code>Group</code> and <code>Grade</code> columns


In [10]:
#Retrieving data for the ID, Group and Grade columns and assigning it to a new variable x

x = df[['ID','Group','Grade']]
x

Unnamed: 0,ID,Group,Grade
0,1,CS_23,95
1,2,CS_25,80
2,3,CS_20,90
3,4,CS_13,75


### Practice Exercise 1: 

##### Problem 1: Create the following dataframe using dictionary, name the dataframe ***students***:

![Снимок экрана 2023-10-10 215914.png](attachment:953c078c-7bfa-4de9-b39f-dd1a128f37f9.png)

In [None]:
#write your code here


##### Problem 2: Retrieve the Marks column and assign it to a variable grade


In [None]:
#write your code here


##### Problem 3: Retrieve the Country and Course columns and assign it to a variable info


In [None]:
#write your code here


*** 

**Using ***loc()*** and ***iloc()*** Functions**

In this exercise, we'll explore the loc() and iloc() functions for data selection.

**loc()** Function:
The loc() function is a label-based data selection method, meaning that you specify the name of the row or column you want to select. It includes the last element of the range passed in it.

Simple syntax:

 * loc[row_label, column_label]

**iloc()** Function:
The iloc() function, on the other hand, is index-based for selecting data. You provide integer indices to select specific rows or columns. Importantly, it does not include the last element of the range passed.

Simple syntax:
   
 * iloc[row_index, column_index]



In [20]:
# You can access the value in column using the name

df.loc[0, 'Grade']

95

In [21]:
# To access the value on the first row and the first column you can use the following code

df.iloc[0, 0]

1

In [18]:
# To access the value on the first row and the third column

df.iloc[0,2]

'CS_23'

**Slicing**

In pandas, you can perform data selection by slicing using the **[ ]** operator. Slicing allows you to choose a specific set of rows and/or columns from a DataFrame.

To slice out a set of rows, you can use the following syntax: ***data[start:stop]***. Here, **"start"** represents the index from which to begin selecting, and **"stop"** represents the index one step beyond the row you wish to include in the selection. Slicing can be done using either index positions or column labels.

It's important to note that when slicing with pandas, the "start" bound is inclusive in the output. For example, if you want to select rows 0, 1, and 2, your code would appear as follows: **df.iloc[0:3]**. This signifies that you are instructing Python to start at index 0 and include rows 0, 1, and 2 up to, but not including, index 3.

Additionally, it's crucial to ensure that the labels you use for slicing are present in the DataFrame; otherwise, you will encounter a KeyError.

It's worth highlighting that indexing by labels, which is done using the loc() function, differs from indexing by integers, which is accomplished using the iloc() function. When using loc(), both the start and stop bounds are inclusive. When using loc(), you can use integers, but these integers refer to the index labels rather than the position.

For instance, selecting rows 1:4 using **loc()** will yield a different result compared to using **iloc()** to select rows 1:4. The former includes the rows with index labels 1, 2, 3, and 4, while the latter includes only rows 1, 2, and 3.

In [25]:
# let us do the slicing using old dataframe df

df.iloc[0:2, 0:2]

Unnamed: 0,ID,Name
0,1,Rose
1,2,John


In [29]:
#let us do the slicing using loc() function on old dataframe df where index column is having labels as 0,1,2
df.loc[0:1,'ID':'Group']

Unnamed: 0,ID,Name,Group
0,1,Rose,CS_23
1,2,John,CS_25


### Practice Exercise 2: 


Use the <code>loc()</code> function,to get the Age of Jane in created by you in previous exercise dataframe called ***students***.


In [None]:
#write your code here


Use the <code>iloc()</code> function to get the Course of Rick in created by you in previous exercise dataframe called ***students***.


In [None]:
#write your code here


using <code>loc()</code> function, do slicing on newly created dataframe by you called **students** to retrieve the Student, Age and country of index column,  having  2 and 3 rows



In [None]:
# Write your code below and press Shift+Enter to execute


***

### Part 2: Importing data with Pandas

Pandas provides a convenient way to handle and manipulate data through the use of data frames. Here, we'll walk through the steps involved in converting data from a CSV file into a data frame.

In [1]:
# Dependency needed to install file 

!pip install xlrd

!pip install openpyxl 

Collecting xlrd
  Using cached xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
Installing collected packages: xlrd
Successfully installed xlrd-2.0.1
Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
     ---------------------------------------- 0.0/250.0 kB ? eta -:--:--
     - -------------------------------------- 10.2/250.0 kB ? eta -:--:--
     ---- -------------------------------- 30.7/250.0 kB 330.3 kB/s eta 0:00:01
     --------- --------------------------- 61.4/250.0 kB 365.7 kB/s eta 0:00:01
     ---------------- ------------------- 112.6/250.0 kB 595.3 kB/s eta 0:00:01
     ------------------------- ---------- 174.1/250.0 kB 748.1 kB/s eta 0:00:01
     -----------------------------------  245.8/250.0 kB 888.8 kB/s eta 0:00:01
     ------------------------------------ 250.0/250.0 kB 853.2 kB/s eta 0:00:00
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Succes

We going to import **Law enforcement surveillance technologies** dataset that contains information about which US law enforcement agencies using surveillance technologies.   More information about data you can find here https://data.world/publicsafety/law-enforcement-surveillance-technologies
The variable **csv_path** stores the path of the location **.csv** file, that is  used as an argument to the **read_csv** function. The output is stored in the new created object **df**, this is a common short name used for a variable referring to a Pandas dataframe.


In [9]:
# Import and Read data from CSV file

csv_path = 'C:/Users/abatz/Desktop/BigDataInLaw/week2/Atlas of Surveillance-20200813                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                '
df = pd.read_csv(csv_path)

To inspect the initial five rows of a dataframe, you can utilize the head() method.

In [8]:
df.head()

Unnamed: 0,AOSNUMBER,City,County,State,Agency,Type of LEA,Summary,Type of Juris,Technology,Vendor,...,Link 1 Type,Link 1 Date,Link 2,Link 2 Source,Link 2 Type,Link 2 Date,Link 3,Link 3 Source,Link 3 Type,Link 3 Date
0,AOS0001,Woodstock,Shenandoah County,VA,Woodstock Police Department,Police,The Woodstock Police Department has used body-...,Municipal,Body-worn Cameras,VML Insurance Programs,...,,12/27/2012,,,,,,,,
1,AOS0002,Rockford,"Ogle County, Winnebago County",IL,Rockford Police Department,Police,"The Rockford Police Department spent $310,000 ...",Municipal,Gunshot Detection,ShotSpotter,...,,10/06/2018,,,,,,,,
2,AOS0003,Atlanta,"DeKalb County, Fulton County",GA,Atlanta Police Department,Police,The Atlanta Police Department began using Shot...,Municipal,Gunshot Detection,ShotSpotter,...,,11/16/2018,https://www.atlantapd.org/Home/ShowDocument?id...,Atlanta Police Department,,09/15/2018,,,,
3,AOS0004,Peoria,Peoria County,IL,Peoria Police Department,Police,The Peoria Police Department began using ShotS...,Municipal,Gunshot Detection,ShotSpotter,...,,12/13/2019,https://www.shotspotter.com/news/peoria-police...,Shotspotter,,03/01/2015,https://web.archive.org/web/20150317034225/htt...,Associated Press,,03/01/2015
4,AOS0005,Goldsboro,Wayne County,NC,Goldsboro Police Department,Police,The Goldsboro Police Department uses ShotSpott...,Municipal,Gunshot Detection,ShotSpotter,...,,10/23/2017,https://www.facebook.com/goldsborodailynews/po...,Goldsboro Police Department Facebook Page,,4/27/2020,,,,


Also we can import dataset directly from data.world using url of dataset 

In [12]:
df2 = pd.read_csv('https://query.data.world/s/nx3t7onwwhapqyj7sibky3gnexutup?dws=00000')
df2.head()

Unnamed: 0,AOSNUMBER,City,County,State,Agency,Type of LEA,Summary,Type of Juris,Technology,Vendor,...,Link 1 Type,Link 1 Date,Link 2,Link 2 Source,Link 2 Type,Link 2 Date,Link 3,Link 3 Source,Link 3 Type,Link 3 Date
0,AOS0001,Woodstock,Shenandoah County,VA,Woodstock Police Department,Police,The Woodstock Police Department has used body-...,Municipal,Body-worn Cameras,VML Insurance Programs,...,,12/27/2012,,,,,,,,
1,AOS0002,Rockford,"Ogle County, Winnebago County",IL,Rockford Police Department,Police,"The Rockford Police Department spent $310,000 ...",Municipal,Gunshot Detection,ShotSpotter,...,,10/06/2018,,,,,,,,
2,AOS0003,Atlanta,"DeKalb County, Fulton County",GA,Atlanta Police Department,Police,The Atlanta Police Department began using Shot...,Municipal,Gunshot Detection,ShotSpotter,...,,11/16/2018,https://www.atlantapd.org/Home/ShowDocument?id...,Atlanta Police Department,,09/15/2018,,,,
3,AOS0004,Peoria,Peoria County,IL,Peoria Police Department,Police,The Peoria Police Department began using ShotS...,Municipal,Gunshot Detection,ShotSpotter,...,,12/13/2019,https://www.shotspotter.com/news/peoria-police...,Shotspotter,,03/01/2015,https://web.archive.org/web/20150317034225/htt...,Associated Press,,03/01/2015
4,AOS0005,Goldsboro,Wayne County,NC,Goldsboro Police Department,Police,The Goldsboro Police Department uses ShotSpott...,Municipal,Gunshot Detection,ShotSpotter,...,,10/23/2017,https://www.facebook.com/goldsborodailynews/po...,Goldsboro Police Department Facebook Page,,4/27/2020,,,,


In order to read excel file, we can use **read_excel** function. We are going to read **Gun Deaths in US from 1999 to 2019** dataset. 

In [18]:
xlsx_path = "C:/Users/abatz/Desktop/BigDataInLaw/week2/gun_deaths_us_1999_2019.xlsx"
df3 = pd.read_excel(xlsx_path)
df3.head()

Unnamed: 0.1,Unnamed: 0,Year,County,County Code,State,State_Name,State Code,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Age Adjusted Rate,Age Adjusted Rate Lower 95% Confidence Interval,Age Adjusted Rate Upper 95% Confidence Interval
0,0,1999,Baldwin County,1003,AL,Alabama,1,22,137555,15.99,10.02,24.21,16.28,10.2,24.64
1,1,1999,Calhoun County,1015,AL,Alabama,1,29,114910,25.24,16.9,36.24,25.21,16.89,36.21
2,2,1999,Chambers County,1017,AL,Alabama,1,10,36527,,13.13,50.35,,12.91,49.51
3,3,1999,Colbert County,1033,AL,Alabama,1,14,54715,,13.99,42.93,,13.54,41.57
4,4,1999,Dallas County,1047,AL,Alabama,1,11,46722,,11.75,42.13,,12.15,43.54


### Practice Exercise 3: 


Use a variable **c** to store the column City from **Law enforcement surveillance technologies** dataset as a dataframe 

In [None]:
# Write your code below


Assign the variable **z** to the dataframe that is consist of the columns **State** and **Deaths** from **Gun Deaths in US from 1999 to 2019** dataset. 


In [None]:
# Write your code below


Access to the data from 1st row and the 3rd column from **Law enforcement surveillance technologies** dataset:


In [None]:
# Write your code below


### Part 3: Numpy

NumPy is a Python library designed for tasks involving arrays, linear algebra, Fourier transforms, and matrices. A NumPy array shares similarities with a list, but it offers significant advantages. NumPy, short for Numerical Python, is an open-source project. The primary array object in NumPy is called **ndarray**, and it comes equipped with an array of functions that simplify working with it.

In data science, arrays find extensive use, particularly when efficiency and resource management are critical.

Typically, NumPy is imported under the alias np. NumPy arrays have fixed sizes, and all elements within an array are of the same data type. You can convert a regular list into a NumPy array after importing the numpy library.

In [20]:
# import numpy library

import numpy as np 

We can cast the python list as follows:


In [27]:
# Create a numpy array

a = np.array([5, 1, 7, 3, 4])
a

array([5, 1, 7, 3, 4])

Similar to lists, you can access individual elements using square brackets.






In [28]:
a[0]

5

 We can change the first element of the array to 10 as follows:

In [29]:
# Assign the first element to 10

a[0] = 10
a

array([10,  1,  7,  3,  4])

We can make slicing:

In [31]:
# Slicing the numpy array
b = a[1:4]
b

array([1, 7, 3])

**Attributes**

The attribute **size** shows the number of elements in the array:


In [32]:
a.size

5

The attribute **ndim** represents the number of array dimensions, or the rank of the array. In our case it is one.

In [34]:
a.ndim

1

The attribute **shape** indicates the size of the array in each dimension:

In [35]:
a.shape

(5,)