-
Create a
DataFrame
nameddf
with 6 nrows with the following columns:A
: random floating point valueB
: randomly assigned categorical values from["test", "train"]
C
: random integer values, constructed from annumpy.array
D
: random integer values, constructed from aSeries
E
: monthly dates "2021-01-01", "2021-02-01", "2021-03-01" ...
-
Convert numeric columns into a
numpy.matrix
and compute the row sums. -
Sort
df
by columnC
. -
Filter
df
for entries for whichB
has valuetrain
andC
has values greater than 0. -
Change the value in the 4th column and 2nd row to 10.
-
Create a column
F
where half the values areNaN
. -
Deal with missing values in two different ways:
- remove entries with missing data
- fill missing values with 0
-
Convert column
A
into a cumulative sum. -
Subtract column
A
from columnB
. -
Plot the numeric columns as a line plot, ensuring that the plot has proper labels.
-
Compute the mean values of each column for groups
train
andtest
. -
Convert the following
DataFrame
froma
intob
(long to wide). Additionally, convert fromb
intoa
(wide to long).
a = pd.DataFrame(
{"value": [1, 2, 3, 4, 5, 6], "group": ["a", "a", "a", "b", "b", "b"]}
)
b = pd.DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6]}
)
- Load the
iris
dataset by
import sklearn as sk
import sklearn.datasets
iris = sk.datasets.load_iris()
-
Visualize the data matrix.
-
Train a random forest classifier to predict the target values and report its performance using an appropriate evaluation metric.
-
Explain how key parameters of the random forest classifier would influence its peformance.
-
Using
Biopython
, collect medline abstracts on "medulloblastoma" published in 2012. Save the data to disk as a CSV table. -
Import the CSV table and build a SQLite database.
-
Obtain the PMID and title of publications for authors with the surname "Shih" from the database.
- Implement a full connected feedforward network from scratch using
only the
numpy
library with the following layers: one input, two hidden, and one output. Neurons in the first hidden layer should use the sigmoid transfer function; those in the second hidden layer should use a ReLU transfer function. The network should be trained using backpropagation of errors.