# Data Mining - Lab - 2

#  Numpy  &  Perform Data Exploration with Pandas

-------------------------------------------------------------------------------
## Numpy

1) NumPy (Numerical Python) is a powerful open-source library in Python used for numerical and scientific computing.<br>
2) It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently.<br>
3) NumPy is highly optimized and written in C, making it much faster than using regular Python lists for numerical operations.<br>
4) It serves as the foundation for many other Python libraries in data science and machine learning, like pandas, TensorFlow, and scikit-learn.<br>
5) With features like broadcasting, vectorization, and integration with C/C++ code, NumPy allows for cleaner and faster code in numerical computations.<br>



### Step 1. Import the Numpy library

In [3]:
import numpy as np



### Step 2. Create a 1D array of numbers

In [31]:
arr = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(arr)
arr = np.arange(0,10)
print(arr)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
[0 1 2 3 4 5 6 7 8 9]


### Step 3. Reshape 1D to 2D Array

In [None]:
#you must have 2*5 numbers in your arr
arr.reshape(2,5)
print(arr)
#gives datatype of array
print(arr.dtype)

[0 1 2 3 4 5 6 7 8 9]
int64


### Step 4. Create a Linspace array

In [40]:
#generate random numbers between start and end
#by default it is 50 
#third param is how many nums you want
arr = np.linspace(1,2)
print(arr)
arr = np.linspace(1,2,5)
print(arr)

[1.         1.02040816 1.04081633 1.06122449 1.08163265 1.10204082
 1.12244898 1.14285714 1.16326531 1.18367347 1.20408163 1.2244898
 1.24489796 1.26530612 1.28571429 1.30612245 1.32653061 1.34693878
 1.36734694 1.3877551  1.40816327 1.42857143 1.44897959 1.46938776
 1.48979592 1.51020408 1.53061224 1.55102041 1.57142857 1.59183673
 1.6122449  1.63265306 1.65306122 1.67346939 1.69387755 1.71428571
 1.73469388 1.75510204 1.7755102  1.79591837 1.81632653 1.83673469
 1.85714286 1.87755102 1.89795918 1.91836735 1.93877551 1.95918367
 1.97959184 2.        ]
[1.   1.25 1.5  1.75 2.  ]


### Step 5. Create a Random Numbered Array

In [None]:
#generates floating point numbers of shape n,m
arr = np.random.rand(3,2)
print(arr)

[[0.56321607 0.42745388]
 [0.2726167  0.37903316]
 [0.73672808 0.01537604]]


### Step 6. Create a Random Integer Array

In [None]:
#3rd param how many numbers you want to generate or sat size
arr = np.random.randint(1,3,5)
print(arr)
arr = np.random.randint(1,30,5)
print(arr)

[2 2 2 2 1]
[28 24 14 29 27]


### Step 7. Create a 1D Array and get Max,Min,ArgMax,ArgMin

In [69]:
arr = np.array(np.random.randint(0,100,12)).reshape(4,3)
print(arr)
print(arr.max())
print(arr.min())
#axis = 0 means finds on Column
print(arr.argmax(axis=0))
#axis = 1 means finds on Row
print(arr.argmax(axis=1))

[[20 20 94]
 [84 63 68]
 [39 44 80]
 [39 25 24]]
94
20
[1 1 0]
[2 0 2 0]


### Step 8. Indexing in 1D Array

In [74]:
arr = np.random.randint(0,100,10)
print(arr)
print(arr[1])
print(arr[1:4])

[10 52 97 40 84 30 32 21  9 57]
52
[52 97 40]


### Step 9. Indexing in 2D Array

In [87]:
arr = np.arange(0,10).reshape(5,2)
print(f"array is {arr}")
print(f"array[3] is : {arr[3]}")
print(f"array[3][1] is : {arr[3][1]}")
print(f"Slicing from 1:4 {arr[1:4]}")

array is [[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
array[3] is : [6 7]
array[3][1] is : 7
Slicing from 1:4 [[2 3]
 [4 5]
 [6 7]]


### Step 10. Conditional Selection

In [99]:
arr = np.random.randint(20,40,10)
print(f"Array : {arr}")
#prints indexes
print(f"elements greater than : {arr[arr>30]}")
print(f"index of elements greater than 32: {arr[np.where(arr>32)]}")

Array : [37 36 25 24 26 36 31 29 38 20]
elements greater than : [37 36 36 31 38]
index of elements greater than 32: [37 36 36 38]


### 🔥You did it! 10 exercises down — you're on fire! 🔥

## Pandas



### Step 1. Import the necessary libraries

In [100]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users and use the 'user_id' as index

In [108]:
users = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user",sep="|",index_col="user_id")
users

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


### Step 4. See the first 25 entries

In [105]:
users.head(25)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


### Step 5. See the last 10 entries

In [106]:
users.tail(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
933,934,61,M,engineer,22902
934,935,42,M,doctor,66221
935,936,24,M,other,32789
936,937,48,M,educator,98072
937,938,38,F,technician,55038
938,939,26,F,student,33319
939,940,32,M,administrator,2215
940,941,20,M,student,97229
941,942,48,F,librarian,78209
942,943,22,M,student,77841


### Step 6. What is the number of observations in the dataset?

In [122]:
users.shape[0]

943

### Step 7. What is the number of columns in the dataset?

In [124]:
users.shape[1]

4

### Step 8. Print the name of all the columns.

In [115]:
users.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

### Step 9. How is the dataset indexed?

In [116]:
users.index

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
      dtype='int64', name='user_id', length=943)

### Step 10. What is the data type of each column?

In [129]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

### Step 11. Print only the occupation column

In [125]:
users.occupation

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

### Step 12. How many different occupations are in this dataset?

In [138]:
#both same (gives unique count)
users.occupation.unique().shape[0]
users.occupation.nunique()
#unique names
users.occupation.unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

### Step 13. What is the most frequent occupation?

In [148]:
#gives most frequent column count
users.occupation.value_counts().head(1)
#gives most frequent column name
users.occupation.value_counts().idxmax()

'student'

### Step 14. Summarize the DataFrame.

In [149]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


### Step 15. Summarize all the columns

In [152]:
users.describe(include="all")

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


### Step 16. Summarize only the occupation column

In [153]:
users.occupation.describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

### Step 17. What is the mean age of users?

In [154]:
users.age.mean()

np.float64(34.05196182396607)

### Step 18. What is the age with least occurrence?

In [163]:
users.age.value_counts().tail()

age
7     1
11    1
66    1
10    1
73    1
Name: count, dtype: int64

### You're not just learning, you're mastering it. Keep aiming higher! 🚀